[
https://issues.apache.org/jira/browse/SPARK-39647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chandni Singh updated SPARK-39647:
----------------------------------
Summary: Block push fails with java.lang.IllegalArgumentException: Active
local dirs list has not been updated by any executor registration even when the
NodeManager hasn't been restarted (was: Block push fails with
java.lang.IllegalArgumentException: Active local dirs list has not been updated
by any executor registration even when NodeManager hasn't been restarted)
> Block push fails with java.lang.IllegalArgumentException: Active local dirs
> list has not been updated by any executor registration even when the
> NodeManager hasn't been restarted
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-39647
> URL: https://issues.apache.org/jira/browse/SPARK-39647
> Project: Spark
> Issue Type: Bug
> Components: Shuffle
> Affects Versions: 3.2.0
> Reporter: Chandni Singh
> Priority: Major
>
> We saw these exceptions during block push:
> {code:java}
> 22/06/24 13:29:14 ERROR RetryingBlockFetcher: Failed to fetch block
> shuffle_170_568_174, and will not retry (0 retries)
> org.apache.spark.network.shuffle.BlockPushException:
> !application_1653753500486_3193550shuffle_170_568_174java.lang.IllegalArgumentException:
> Active local dirs list has not been updated by any executor registration
> at
> org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:92)
> at
> org.apache.spark.network.shuffle.RemoteBlockPushResolver.getActiveLocalDirs(RemoteBlockPushResolver.java:300)
> at
> org.apache.spark.network.shuffle.RemoteBlockPushResolver.getFile(RemoteBlockPushResolver.java:290)
> at
> org.apache.spark.network.shuffle.RemoteBlockPushResolver.getMergedShuffleFile(RemoteBlockPushResolver.java:312)
> at
> org.apache.spark.network.shuffle.RemoteBlockPushResolver.lambda$getOrCreateAppShufflePartitionInfo$1(RemoteBlockPushResolver.java:168)
> 22/06/24 13:29:14 WARN UnsafeShuffleWriter: Pushing block shuffle_170_568_174
> to BlockManagerId(, node-x, 7337, None) failed.
> {code}
> Note: The NodeManager on node-x (node against which this exception was seen)
> was not restarted.
> The reason this happened is because the executor registers the block manager
> with {{BlockManagerMaster}} before it registers with the ESS. In push-based
> shuffle, a block manager is selected by the driver as a merger for the
> shuffle push. However, the ESS on that node can successfully merge the block
> only if it has received the metadata about merged directories from the local
> executor (sent when the local executor registers with the ESS). If this local
> executor registration is delayed, but the ESS host got picked up as a merger
> then it will fail to merge the blocks pushed to it which is what happened
> here.
> The local executor on node-x is executor 754 and the block manager
> registration happened at 13:28:11
> {code:java}
> 22/06/24 13:28:11 INFO ExecutorAllocationManager: New executor 754 has
> registered (new total is 1200)
> 22/06/24 13:28:11 INFO BlockManagerMasterEndpoint: Registering block manager
> node-x:16747 with 2004.6 MB RAM, BlockManagerId(754, node-x, 16747, None)
> {code}
> The application got registered with shuffle server at node-x at 13:29:40
> {code:java}
> 2022-06-24 13:29:40,343 INFO
> org.apache.spark.network.shuffle.RemoteBlockPushResolver: Updated the active
> local dirs [/grid/i/tmp/yarn/, /grid/g/tmp/yarn/, /grid/b/tmp/yarn/,
> /grid/e/tmp/yarn/, /grid/h/tmp/yarn/, /grid/f/tmp/yarn/, /grid/d/tmp/yarn/,
> /grid/c/tmp/yarn/] for application application_1653753500486_3193550
> {code}
> node-x was selected as a merger by the driver after 13:28:11 and when the
> executors started pushing to it, all those pushes failed until 13:29:40
> We can fix by having the executor register with ESS before it registers the
> block manager with the {{BlockManagerMaster}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]