[ 
https://issues.apache.org/jira/browse/SPARK-33747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weixiuli updated SPARK-33747:
-----------------------------
    Description: When a fetch failure happened, DAGScheduler will try to 
unregister the corresponding map output. The current logic has a race condition 
that the new map stage attempt is running while the current reduce stage 
attempt returns another fetch failure (note: the current reduce stage firstly 
returns a fetch failure to make the maps stage is rerunning, and then the 
rerunning map stage may return some mapstatus of the failed MapId before the 
current reduce stage returns another fetch failure at the same MapId, the 
current reduce is last attempt due to the new map stage is not yet completed). 
In this case, if the map output is always unregistered, it may actually 
unregister the map output from the new map stage attempt.  (was: When a fetch 
failure happened, DAGScheduler will try to unregister the corresponding map 
output. The current logic has a race condition that the new map stage attempt 
is running while the old reduce stage attempt returns another fetch failure. In 
this case, if the map output is always unregistered, it may actually unregister 
the map output from the new map stage attempt.)

> Avoid calling unregisterMapOutput when the map stage is being rerunning.
> ------------------------------------------------------------------------
>
>                 Key: SPARK-33747
>                 URL: https://issues.apache.org/jira/browse/SPARK-33747
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>    Affects Versions: 2.4.5, 3.0.1
>            Reporter: weixiuli
>            Priority: Major
>             Fix For: 2.4.5, 3.0.1
>
>
> When a fetch failure happened, DAGScheduler will try to unregister the 
> corresponding map output. The current logic has a race condition that the new 
> map stage attempt is running while the current reduce stage attempt returns 
> another fetch failure (note: the current reduce stage firstly returns a fetch 
> failure to make the maps stage is rerunning, and then the rerunning map stage 
> may return some mapstatus of the failed MapId before the current reduce stage 
> returns another fetch failure at the same MapId, the current reduce is last 
> attempt due to the new map stage is not yet completed). In this case, if the 
> map output is always unregistered, it may actually unregister the map output 
> from the new map stage attempt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to