Tengfei Huang created SPARK-54796:
-------------------------------------

             Summary: NPE caused by race condition between Executor 
initialization and shuffle migration
                 Key: SPARK-54796
                 URL: https://issues.apache.org/jira/browse/SPARK-54796
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 4.1.0
            Reporter: Tengfei Huang


When there is executor decommission and shuffles data need to be migrated, will 
ask master to `getPeers` as the target for shuffle data migration.

And the blockManager would be a candidate right after the blockManager get 
`initialized` and registered to master, while at that time the `Executor` could 
be not fully initialized, and the `SparkEnv.shuffleManager` could be null.

In Executor, we initialize blockManager earlier than shuffleManager.

[https://github.com/apache/spark/blob/d0cbad56a10502a1c931d5967beeae2369f6fa15/core/src/main/scala/org/apache/spark/executor/Executor.scala#L163]

[https://github.com/apache/spark/blob/d0cbad56a10502a1c931d5967beeae2369f6fa15/core/src/main/scala/org/apache/spark/executor/Executor.scala#L371]

While handling the shuffle migration requests, the `lazy shuffleManager` in 
`BlockManager` would be initialized as `null`. Then later operations on 
`shuffleManager` would lead to `NullPointerException`.




Error logs:
```

2025-12-22 10:00:06 ERROR NettyBlockTransferService:110 - Error while uploading 
shuffle_0_0_0
.data as stream
java.lang.RuntimeException: java.lang.NullPointerException: Cannot invoke 
"org.apache.spark.shuffle.ShuffleManager.shuffleBlockResolver()" because the 
return value of "org.apache.spark.storage.BlockManager.shuffleManager()" is null
        at 
org.apache.spark.storage.BlockManager.migratableResolver$lzycompute(BlockManager.scala:314)
        at 
org.apache.spark.storage.BlockManager.migratableResolver(BlockManager.scala:313)
        at 
org.apache.spark.storage.BlockManager.putBlockDataAsStream(BlockManager.scala:777)
        at 
org.apache.spark.network.netty.NettyBlockRpcServer.receiveStream(NettyBlockRpcServer.scala:184)
        at 
org.apache.spark.network.server.TransportRequestHandler.processStreamUpload(TransportRequestHandler.java:208)
        at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:117)
        at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:143)
        at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:55)
        at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:356)
        at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:293)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:354)

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to