Chandni Singh created SPARK-39647:
-------------------------------------
Summary: Block push fails with java.lang.IllegalArgumentException:
Active local dirs list has not been updated by any executor registration even
when NodeManager hasn't been restarted
Key: SPARK-39647
URL: https://issues.apache.org/jira/browse/SPARK-39647
Project: Spark
Issue Type: Bug
Components: Shuffle
Affects Versions: 3.2.0
Reporter: Chandni Singh
We saw these exceptions during block push:
{code:java}
22/06/24 13:29:14 ERROR RetryingBlockFetcher: Failed to fetch block
shuffle_170_568_174, and will not retry (0 retries)
org.apache.spark.network.shuffle.BlockPushException:
!application_1653753500486_3193550shuffle_170_568_174java.lang.IllegalArgumentException:
Active local dirs list has not been updated by any executor registration
at
org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:92)
at
org.apache.spark.network.shuffle.RemoteBlockPushResolver.getActiveLocalDirs(RemoteBlockPushResolver.java:300)
at
org.apache.spark.network.shuffle.RemoteBlockPushResolver.getFile(RemoteBlockPushResolver.java:290)
at
org.apache.spark.network.shuffle.RemoteBlockPushResolver.getMergedShuffleFile(RemoteBlockPushResolver.java:312)
at
org.apache.spark.network.shuffle.RemoteBlockPushResolver.lambda$getOrCreateAppShufflePartitionInfo$1(RemoteBlockPushResolver.java:168)
22/06/24 13:29:14 WARN UnsafeShuffleWriter: Pushing block shuffle_170_568_174
to BlockManagerId(, node-x, 7337, None) failed.
{code}
Note: The NodeManager on node-x (node against which this exception was seen)
was not restarted.
The reason this happened is because the executor registers the block manager
with {{BlockManagerMaster}} before it registers with the ESS. In push-based
shuffle, a block manager is selected by the driver as a merger for the shuffle
push. However, the ESS on that node can successfully merge the block only if it
has received the metadata about merged directories from the local executor
(sent when the local executor registers with the ESS). If this local executor
registration is delayed, but the ESS host got picked up as a merger then it
will fail to merge the blocks pushed to it which is what happened here.
The local executor on node-x is executor 754 and the block manager registration
happened at 13:28:11
{code:java}
22/06/24 13:28:11 INFO ExecutorAllocationManager: New executor 754 has
registered (new total is 1200)
22/06/24 13:28:11 INFO BlockManagerMasterEndpoint: Registering block manager
node-x:16747 with 2004.6 MB RAM, BlockManagerId(754, node-x, 16747, None)
{code}
The application got registered with shuffle server at node-x at 13:29:40
{code:java}
2022-06-24 13:29:40,343 INFO
org.apache.spark.network.shuffle.RemoteBlockPushResolver: Updated the active
local dirs [/grid/i/tmp/yarn/, /grid/g/tmp/yarn/, /grid/b/tmp/yarn/,
/grid/e/tmp/yarn/, /grid/h/tmp/yarn/, /grid/f/tmp/yarn/, /grid/d/tmp/yarn/,
/grid/c/tmp/yarn/] for application application_1653753500486_3193550
{code}
node-x was selected as a merger by the driver after 13:28:11 and when the
executors started pushing to it, all those pushes failed until 13:29:40
We can fix by having the executor register with ESS before it registers the
block manager with the {{BlockManagerMaster}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]