[
https://issues.apache.org/jira/browse/SPARK-17370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15457168#comment-15457168
]
Apache Spark commented on SPARK-17370:
--------------------------------------
User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/14931
> Shuffle service files not invalidated when a slave is lost
> ----------------------------------------------------------
>
> Key: SPARK-17370
> URL: https://issues.apache.org/jira/browse/SPARK-17370
> Project: Spark
> Issue Type: Bug
> Reporter: Eric Liang
>
> DAGScheduler invalidates shuffle files when an executor loss event occurs,
> but not when the external shuffle service is enabled. This is because when
> shuffle service is on, the shuffle file lifetime can exceed the executor
> lifetime.
> However, it doesn't invalidate shuffle files when the shuffle service itself
> is lost (due to whole slave loss). This can cause long hangs when slaves are
> lost since the file loss is not detected until a subsequent stage attempts to
> read the shuffle files.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]