Eric Liang created SPARK-17370:
----------------------------------
Summary: Shuffle service files not invalidated when a slave is lost
Key: SPARK-17370
URL: https://issues.apache.org/jira/browse/SPARK-17370
Project: Spark
Issue Type: Bug
Reporter: Eric Liang
DAGScheduler invalidates shuffle files when an executor loss event occurs, but
not when the external shuffle service is enabled. This is because when shuffle
service is on, the shuffle file lifetime can exceed the executor lifetime.
However, it doesn't invalidate shuffle files when the shuffle service itself is
lost (due to whole slave loss). This can cause long hangs when slaves are lost
since the file loss is not detected until a subsequent stage attempts to read
the shuffle files.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]