Udit Mehrotra created SPARK-20115:
-------------------------------------

             Summary: Fix DAGScheduler to recompute all the lost shuffle blocks 
when external shuffle service is unavailable
                 Key: SPARK-20115
                 URL: https://issues.apache.org/jira/browse/SPARK-20115
             Project: Spark
          Issue Type: Bug
          Components: Shuffle, Spark Core, YARN
    Affects Versions: 2.1.0, 2.0.2
         Environment: Spark on Yarn with external shuffle service enabled, 
running on AWS EMR cluster.
            Reporter: Udit Mehrotra


The Spark’s DAGScheduler currently does not recompute all the lost shuffle 
blocks on a host when a FetchFailed exception occurs, while fetching shuffle 
blocks from another executor with external shuffle service enabled. Instead it 
only recomputes the lost shuffle blocks computed by the executor for which the 
FetchFailed exception occurred. This works fine for Internal shuffle scenario, 
where the executors serve their own shuffle blocks and hence only the shuffle 
blocks for that executor should be considered lost. However, when External 
Shuffle Service is being used, a FetchFailed exception would mean that the 
external shuffle service running on that host has become unavailable. This in 
turn is sufficient to assume that all the shuffle blocks which were managed by 
the Shuffle service on that host are lost. Therefore, just recomputing the 
shuffle blocks associated with the particular Executor for which FetchFailed 
exception occurred is not sufficient. We need to recompute all the shuffle 
blocks, managed by that service because there could be multiple executors 
running on that host.
 
Since not all the shuffle blocks (for all the executors on the host) are 
recomputed, this causes future attempts of the reduce stage to fail as well 
because the new tasks scheduled still keep trying to reach the old location of 
the shuffle blocks (which were not recomputed) and keep throwing further 
FetchFailed exceptions. This ultimately causes the job to fail, after the 
reduce stage has been retried 4 times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to