Eric Liang created SPARK-17371:
----------------------------------

             Summary: Resubmitted stage outputs deleted by zombie map tasks on 
stop()
                 Key: SPARK-17371
                 URL: https://issues.apache.org/jira/browse/SPARK-17371
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
            Reporter: Eric Liang


It seems that old shuffle map tasks hanging around after a stage resubmit will 
delete intended shuffle output files on stop(), causing downstream stages to 
fail even after successful resubmit completion. This can happen easily if the 
prior map task is waiting for a network timeout when its stage is resubmitted.

This can cause unnecessary stage resubmits, sometimes multiple times, and very 
confusing FetchFailure messages that report shuffle index files missing from 
the local disk.

Given that IndexShuffleBlockResolver commits data atomically, it seems 
unnecessary to ever delete committed task output: even in the rare case that a 
task is failed after it finishes committing shuffle output, it should be safe 
to retain that output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to