[ 
https://issues.apache.org/jira/browse/SPARK-38965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-38965:
----------------------------
    Description: 
For push-based shuffle service, there are many 
{{BLOCK_APPEND_COLLISION_DETECTED}} when there are many small map tasks 
outputs. In {{{}RemoteBlockPushResolver{}}}, if one map task pushed blocks is 
writing, the others map tasks pushed blocks will failed in {{onComplete()}} 
method.
And {{RemoteBlockPushResolver}} has no memory limit , so many executors will 
OOM when there are many small pushed blocks waiting to be written to the final 
data file.

  was:
We should retry transfer blocks if *errorHandler.shouldRetryError(e)* return 
true, 

Even though that exception may not a IOException, for example:
{code:java}
org.apache.spark.network.server.BlockPushNonFatalFailure: Block 
shufflePush_0_0_3316_5647 experienced merge collision on the server side
{code}


> Retry transfer blocks for exceptions listed in the error handler 
> -----------------------------------------------------------------
>
>                 Key: SPARK-38965
>                 URL: https://issues.apache.org/jira/browse/SPARK-38965
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 3.3.0
>            Reporter: Wan Kun
>            Priority: Minor
>
> For push-based shuffle service, there are many 
> {{BLOCK_APPEND_COLLISION_DETECTED}} when there are many small map tasks 
> outputs. In {{{}RemoteBlockPushResolver{}}}, if one map task pushed blocks is 
> writing, the others map tasks pushed blocks will failed in {{onComplete()}} 
> method.
> And {{RemoteBlockPushResolver}} has no memory limit , so many executors will 
> OOM when there are many small pushed blocks waiting to be written to the 
> final data file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to