[ 
https://issues.apache.org/jira/browse/SPARK-20230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20230:
------------------------------------

    Assignee:     (was: Apache Spark)

> FetchFailedExceptions should invalidate file caches in MapOutputTracker even 
> if newer stages are launched
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-20230
>                 URL: https://issues.apache.org/jira/browse/SPARK-20230
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0
>            Reporter: Burak Yavuz
>
> If you lose instances that have shuffle outputs, you will start observing 
> messages like:
> {code}
> 17/03/24 11:49:23 WARN TaskSetManager: Lost task 0.0 in stage 64.1 (TID 3849, 
> 172.128.196.240, executor 0): FetchFailed(BlockManagerId(4, 172.128.200.157, 
> 4048, None), shuffleId=16, mapId=2, reduceId=3, message=
> {code}
> Generally, these messages are followed by:
> {code}
> 17/03/24 11:49:23 INFO DAGScheduler: Executor lost: 4 (epoch 20)
> 17/03/24 11:49:23 INFO BlockManagerMasterEndpoint: Trying to remove executor 
> 4 from BlockManagerMaster.
> 17/03/24 11:49:23 INFO BlockManagerMaster: Removed 4 successfully in 
> removeExecutor
> 17/03/24 11:49:23 INFO DAGScheduler: Shuffle files lost for executor: 4 
> (epoch 20)
> 17/03/24 11:49:23 INFO ShuffleMapStage: ShuffleMapStage 63 is now unavailable 
> on executor 4 (73/89, false)
> {code}
> which is great. Spark resubmits tasks for data that has been lost. However, 
> if you have cascading instance failures, then you may come across:
> {code}
> 17/03/24 11:48:39 INFO DAGScheduler: Ignoring fetch failure from 
> ResultTask(64, 46) as it's from ResultStage 64 attempt 0 and there is a more 
> recent attempt for that stage (attempt ID 1) running
> {code}
> which don't invalidate file outputs. In later retries of the stage, Spark 
> will attempt to access files on machines that don't exist anymore, and then 
> after 4 tries, Spark will give up. If it had not ignored the fetch failure, 
> and invalidated the cache, then most of the lost files could have been 
> computed during one of the previous retries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to