Tengfei Huang created SPARK-54796:
-------------------------------------
Summary: NPE caused by race condition between Executor
initialization and shuffle migration
Key: SPARK-54796
URL: https://issues.apache.org/jira/browse/SPARK-54796
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 4.1.0
Reporter: Tengfei Huang
When there is executor decommission and shuffles data need to be migrated, will
ask master to `getPeers` as the target for shuffle data migration.
And the blockManager would be a candidate right after the blockManager get
`initialized` and registered to master, while at that time the `Executor` could
be not fully initialized, and the `SparkEnv.shuffleManager` could be null.
In Executor, we initialize blockManager earlier than shuffleManager.
[https://github.com/apache/spark/blob/d0cbad56a10502a1c931d5967beeae2369f6fa15/core/src/main/scala/org/apache/spark/executor/Executor.scala#L163]
[https://github.com/apache/spark/blob/d0cbad56a10502a1c931d5967beeae2369f6fa15/core/src/main/scala/org/apache/spark/executor/Executor.scala#L371]
While handling the shuffle migration requests, the `lazy shuffleManager` in
`BlockManager` would be initialized as `null`. Then later operations on
`shuffleManager` would lead to `NullPointerException`.
Error logs:
```
2025-12-22 10:00:06 ERROR NettyBlockTransferService:110 - Error while uploading
shuffle_0_0_0
.data as stream
java.lang.RuntimeException: java.lang.NullPointerException: Cannot invoke
"org.apache.spark.shuffle.ShuffleManager.shuffleBlockResolver()" because the
return value of "org.apache.spark.storage.BlockManager.shuffleManager()" is null
at
org.apache.spark.storage.BlockManager.migratableResolver$lzycompute(BlockManager.scala:314)
at
org.apache.spark.storage.BlockManager.migratableResolver(BlockManager.scala:313)
at
org.apache.spark.storage.BlockManager.putBlockDataAsStream(BlockManager.scala:777)
at
org.apache.spark.network.netty.NettyBlockRpcServer.receiveStream(NettyBlockRpcServer.scala:184)
at
org.apache.spark.network.server.TransportRequestHandler.processStreamUpload(TransportRequestHandler.java:208)
at
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:117)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:143)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:55)
at
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:356)
at
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:293)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:354)
```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]