[
https://issues.apache.org/jira/browse/SPARK-20115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944312#comment-15944312
]
Apache Spark commented on SPARK-20115:
--------------------------------------
User 'umehrot2' has created a pull request for this issue:
https://github.com/apache/spark/pull/17445
> Fix DAGScheduler to recompute all the lost shuffle blocks when external
> shuffle service is unavailable
> ------------------------------------------------------------------------------------------------------
>
> Key: SPARK-20115
> URL: https://issues.apache.org/jira/browse/SPARK-20115
> Project: Spark
> Issue Type: Bug
> Components: Shuffle, Spark Core, YARN
> Affects Versions: 2.0.2, 2.1.0
> Environment: Spark on Yarn with external shuffle service enabled,
> running on AWS EMR cluster.
> Reporter: Udit Mehrotra
>
> The Spark’s DAGScheduler currently does not recompute all the lost shuffle
> blocks on a host when a FetchFailed exception occurs, while fetching shuffle
> blocks from another executor with external shuffle service enabled. Instead
> it only recomputes the lost shuffle blocks computed by the executor for which
> the FetchFailed exception occurred. This works fine for Internal shuffle
> scenario, where the executors serve their own shuffle blocks and hence only
> the shuffle blocks for that executor should be considered lost. However, when
> External Shuffle Service is being used, a FetchFailed exception would mean
> that the external shuffle service running on that host has become
> unavailable. This in turn is sufficient to assume that all the shuffle blocks
> which were managed by the Shuffle service on that host are lost. Therefore,
> just recomputing the shuffle blocks associated with the particular Executor
> for which FetchFailed exception occurred is not sufficient. We need to
> recompute all the shuffle blocks, managed by that service because there could
> be multiple executors running on that host.
>
> Since not all the shuffle blocks (for all the executors on the host) are
> recomputed, this causes future attempts of the reduce stage to fail as well
> because the new tasks scheduled still keep trying to reach the old location
> of the shuffle blocks (which were not recomputed) and keep throwing further
> FetchFailed exceptions. This ultimately causes the job to fail, after the
> reduce stage has been retried 4 times.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]