[ 
https://issues.apache.org/jira/browse/SPARK-42784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fencheng Mei updated SPARK-42784:
---------------------------------
    Description: 
After we massively enabled push-based shuffle in our production environment, we 
found some warn messages appearing in the server-side log messages.

the warning log like:

ShuffleBlockPusher: Pushing block shufflePush_3_0_5352_935 to 
BlockManagerId(shuffle-push-merger, zw06-data-hdp-dn08251.mt, 7337, None) 
failed.
java.lang.RuntimeException: java.lang.RuntimeException: Cannot initialize 
merged shuffle partition for appId application_1671244879475_44020960 shuffleId 
3 shuffleMergeId 0 reduceId 935.

After investigation, we identified the triggering mechanism of the bug。

The driver requested two different containers on the same physical machine. 
During the creation of the 'push-merged' directory in the first container 
(container_1), the mergeDir was created first, then the subDir were created 
based on the value of the "spark.diskStore.subDirectories" parameter. However, 
the resources of container_1 were preempted during the creation of the 
sub-directories, resulting in subDir not being created (only part of it was 
created ). As the mergeDir still existed, the second container (container_2) 
was unable to create further subDir (as it assumed that all directories had 
already been created).

 

  was:
After we massively enabled push-based shuffle in our production environment, we 
found some warn messages appearing in the server-side log messages. like: 
23/02/21 11:17:27 WARN shuffle-client-7-1 ShuffleBlockPusher: Pushing block 
shufflePush_3_0_5352_935 to BlockManagerId(shuffle-push-merger, 
zw06-data-hdp-dn08251.mt, 7337, None) failed.
java.lang.RuntimeException: java.lang.RuntimeException: Cannot initialize 
merged shuffle partition for appId application_1671244879475_44020960 shuffleId 
3 shuffleMergeId 0 reduceId 935

 


> Fix the problem of incomplete creation of subdirectories in push merged 
> localDir
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-42784
>                 URL: https://issues.apache.org/jira/browse/SPARK-42784
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core
>    Affects Versions: 3.3.2
>            Reporter: Fencheng Mei
>            Priority: Major
>
> After we massively enabled push-based shuffle in our production environment, 
> we found some warn messages appearing in the server-side log messages.
> the warning log like:
> ShuffleBlockPusher: Pushing block shufflePush_3_0_5352_935 to 
> BlockManagerId(shuffle-push-merger, zw06-data-hdp-dn08251.mt, 7337, None) 
> failed.
> java.lang.RuntimeException: java.lang.RuntimeException: Cannot initialize 
> merged shuffle partition for appId application_1671244879475_44020960 
> shuffleId 3 shuffleMergeId 0 reduceId 935.
> After investigation, we identified the triggering mechanism of the bug。
> The driver requested two different containers on the same physical machine. 
> During the creation of the 'push-merged' directory in the first container 
> (container_1), the mergeDir was created first, then the subDir were created 
> based on the value of the "spark.diskStore.subDirectories" parameter. 
> However, the resources of container_1 were preempted during the creation of 
> the sub-directories, resulting in subDir not being created (only part of it 
> was created ). As the mergeDir still existed, the second container 
> (container_2) was unable to create further subDir (as it assumed that all 
> directories had already been created).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to