Eric Liang created SPARK-17371:
----------------------------------
Summary: Resubmitted stage outputs deleted by zombie map tasks on
stop()
Key: SPARK-17371
URL: https://issues.apache.org/jira/browse/SPARK-17371
Project: Spark
Issue Type: Bug
Components: Spark Core
Reporter: Eric Liang
It seems that old shuffle map tasks hanging around after a stage resubmit will
delete intended shuffle output files on stop(), causing downstream stages to
fail even after successful resubmit completion. This can happen easily if the
prior map task is waiting for a network timeout when its stage is resubmitted.
This can cause unnecessary stage resubmits, sometimes multiple times, and very
confusing FetchFailure messages that report shuffle index files missing from
the local disk.
Given that IndexShuffleBlockResolver commits data atomically, it seems
unnecessary to ever delete committed task output: even in the rare case that a
task is failed after it finishes committing shuffle output, it should be safe
to retain that output.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]