[
https://issues.apache.org/jira/browse/SPARK-38965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wan Kun updated SPARK-38965:
----------------------------
Description:
For push-based shuffle service, there are many
{{BLOCK_APPEND_COLLISION_DETECTED}} when there are many small map tasks
outputs. In {{{}RemoteBlockPushResolver{}}}, if one map task pushed blocks is
writing, the others map tasks pushed blocks will failed in {{onComplete()}}
method.
And {{RemoteBlockPushResolver}} has no memory limit , so many executors will
OOM when there are many small pushed blocks waiting to be written to the final
data file.
was:
We should retry transfer blocks if *errorHandler.shouldRetryError(e)* return
true,
Even though that exception may not a IOException, for example:
{code:java}
org.apache.spark.network.server.BlockPushNonFatalFailure: Block
shufflePush_0_0_3316_5647 experienced merge collision on the server side
{code}
> Retry transfer blocks for exceptions listed in the error handler
> -----------------------------------------------------------------
>
> Key: SPARK-38965
> URL: https://issues.apache.org/jira/browse/SPARK-38965
> Project: Spark
> Issue Type: Bug
> Components: Shuffle
> Affects Versions: 3.3.0
> Reporter: Wan Kun
> Priority: Minor
>
> For push-based shuffle service, there are many
> {{BLOCK_APPEND_COLLISION_DETECTED}} when there are many small map tasks
> outputs. In {{{}RemoteBlockPushResolver{}}}, if one map task pushed blocks is
> writing, the others map tasks pushed blocks will failed in {{onComplete()}}
> method.
> And {{RemoteBlockPushResolver}} has no memory limit , so many executors will
> OOM when there are many small pushed blocks waiting to be written to the
> final data file.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]