[ 
https://issues.apache.org/jira/browse/SPARK-54796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-54796:
-----------------------------------
    Labels: pull-request-available  (was: )

> NPE caused by race condition between Executor initialization and shuffle 
> migration
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-54796
>                 URL: https://issues.apache.org/jira/browse/SPARK-54796
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 4.1.0
>            Reporter: Tengfei Huang
>            Priority: Major
>              Labels: pull-request-available
>
> When there is executor decommission and shuffles data need to be migrated, 
> will ask master to `getPeers` as the target for shuffle data migration.
> And the blockManager would be a candidate right after the blockManager get 
> `initialized` and registered to master, while at that time the `Executor` 
> could be not fully initialized, and the `SparkEnv.shuffleManager` could be 
> null.
> In Executor, we initialize blockManager earlier than shuffleManager.
> [https://github.com/apache/spark/blob/d0cbad56a10502a1c931d5967beeae2369f6fa15/core/src/main/scala/org/apache/spark/executor/Executor.scala#L163]
> [https://github.com/apache/spark/blob/d0cbad56a10502a1c931d5967beeae2369f6fa15/core/src/main/scala/org/apache/spark/executor/Executor.scala#L371]
> While handling the shuffle migration requests, the `lazy shuffleManager` in 
> `BlockManager` would be initialized as `null`. Then later operations on 
> `shuffleManager` would lead to `NullPointerException`.
> Error logs:
> ```
> 2025-12-22 10:00:06 ERROR NettyBlockTransferService:110 - Error while 
> uploading shuffle_0_0_0
> .data as stream
> java.lang.RuntimeException: java.lang.NullPointerException: Cannot invoke 
> "org.apache.spark.shuffle.ShuffleManager.shuffleBlockResolver()" because the 
> return value of "org.apache.spark.storage.BlockManager.shuffleManager()" is 
> null
>         at 
> org.apache.spark.storage.BlockManager.migratableResolver$lzycompute(BlockManager.scala:314)
>         at 
> org.apache.spark.storage.BlockManager.migratableResolver(BlockManager.scala:313)
>         at 
> org.apache.spark.storage.BlockManager.putBlockDataAsStream(BlockManager.scala:777)
>         at 
> org.apache.spark.network.netty.NettyBlockRpcServer.receiveStream(NettyBlockRpcServer.scala:184)
>         at 
> org.apache.spark.network.server.TransportRequestHandler.processStreamUpload(TransportRequestHandler.java:208)
>         at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:117)
>         at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:143)
>         at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:55)
>         at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:356)
>         at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:293)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:354)
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to