[
https://issues.apache.org/jira/browse/SPARK-54796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-54796:
-----------------------------------
Labels: pull-request-available (was: )
> NPE caused by race condition between Executor initialization and shuffle
> migration
> ----------------------------------------------------------------------------------
>
> Key: SPARK-54796
> URL: https://issues.apache.org/jira/browse/SPARK-54796
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 4.1.0
> Reporter: Tengfei Huang
> Priority: Major
> Labels: pull-request-available
>
> When there is executor decommission and shuffles data need to be migrated,
> will ask master to `getPeers` as the target for shuffle data migration.
> And the blockManager would be a candidate right after the blockManager get
> `initialized` and registered to master, while at that time the `Executor`
> could be not fully initialized, and the `SparkEnv.shuffleManager` could be
> null.
> In Executor, we initialize blockManager earlier than shuffleManager.
> [https://github.com/apache/spark/blob/d0cbad56a10502a1c931d5967beeae2369f6fa15/core/src/main/scala/org/apache/spark/executor/Executor.scala#L163]
> [https://github.com/apache/spark/blob/d0cbad56a10502a1c931d5967beeae2369f6fa15/core/src/main/scala/org/apache/spark/executor/Executor.scala#L371]
> While handling the shuffle migration requests, the `lazy shuffleManager` in
> `BlockManager` would be initialized as `null`. Then later operations on
> `shuffleManager` would lead to `NullPointerException`.
> Error logs:
> ```
> 2025-12-22 10:00:06 ERROR NettyBlockTransferService:110 - Error while
> uploading shuffle_0_0_0
> .data as stream
> java.lang.RuntimeException: java.lang.NullPointerException: Cannot invoke
> "org.apache.spark.shuffle.ShuffleManager.shuffleBlockResolver()" because the
> return value of "org.apache.spark.storage.BlockManager.shuffleManager()" is
> null
> at
> org.apache.spark.storage.BlockManager.migratableResolver$lzycompute(BlockManager.scala:314)
> at
> org.apache.spark.storage.BlockManager.migratableResolver(BlockManager.scala:313)
> at
> org.apache.spark.storage.BlockManager.putBlockDataAsStream(BlockManager.scala:777)
> at
> org.apache.spark.network.netty.NettyBlockRpcServer.receiveStream(NettyBlockRpcServer.scala:184)
> at
> org.apache.spark.network.server.TransportRequestHandler.processStreamUpload(TransportRequestHandler.java:208)
> at
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:117)
> at
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:143)
> at
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:55)
> at
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:356)
> at
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:293)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:354)
> ```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]