[
https://issues.apache.org/jira/browse/SPARK-20230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15957477#comment-15957477
]
Apache Spark commented on SPARK-20230:
--------------------------------------
User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/17543
> FetchFailedExceptions should invalidate file caches in MapOutputTracker even
> if newer stages are launched
> ---------------------------------------------------------------------------------------------------------
>
> Key: SPARK-20230
> URL: https://issues.apache.org/jira/browse/SPARK-20230
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.1.0
> Reporter: Burak Yavuz
>
> If you lose instances that have shuffle outputs, you will start observing
> messages like:
> {code}
> 17/03/24 11:49:23 WARN TaskSetManager: Lost task 0.0 in stage 64.1 (TID 3849,
> 172.128.196.240, executor 0): FetchFailed(BlockManagerId(4, 172.128.200.157,
> 4048, None), shuffleId=16, mapId=2, reduceId=3, message=
> {code}
> Generally, these messages are followed by:
> {code}
> 17/03/24 11:49:23 INFO DAGScheduler: Executor lost: 4 (epoch 20)
> 17/03/24 11:49:23 INFO BlockManagerMasterEndpoint: Trying to remove executor
> 4 from BlockManagerMaster.
> 17/03/24 11:49:23 INFO BlockManagerMaster: Removed 4 successfully in
> removeExecutor
> 17/03/24 11:49:23 INFO DAGScheduler: Shuffle files lost for executor: 4
> (epoch 20)
> 17/03/24 11:49:23 INFO ShuffleMapStage: ShuffleMapStage 63 is now unavailable
> on executor 4 (73/89, false)
> {code}
> which is great. Spark resubmits tasks for data that has been lost. However,
> if you have cascading instance failures, then you may come across:
> {code}
> 17/03/24 11:48:39 INFO DAGScheduler: Ignoring fetch failure from
> ResultTask(64, 46) as it's from ResultStage 64 attempt 0 and there is a more
> recent attempt for that stage (attempt ID 1) running
> {code}
> which don't invalidate file outputs. In later retries of the stage, Spark
> will attempt to access files on machines that don't exist anymore, and then
> after 4 tries, Spark will give up. If it had not ignored the fetch failure,
> and invalidated the cache, then most of the lost files could have been
> computed during one of the previous retries.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]