[jira] [Commented] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation

Imran Rashid (JIRA) Wed, 02 Aug 2017 12:13:08 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111548#comment-16111548
 ]


Imran Rashid commented on SPARK-13669:
--------------------------------------

Since I dragged my feet about the usefulness of this feature during the review, 
I thought in fairness I should mention that I have seen users run into the 
scenario where this patch would have been helpful.

The cases I've seen have been from incorrect cluster maintenance.  A disk went 
bad in the cluster, and node gets blacklisted by spark; at some point later on, 
someone attempts to fix the cluster, but instead of decommissioning the node, 
they just stop the NM on the node.  The executors stay up; and even if they 
were blacklisted, when the blacklist expires, then they try to run some tasks 
again.  Even with one bad disk, by chance some of those tasks succeed, but then 
you get fetch failures because the external shuffle service is entirely down.

I still have qualms about this feature, though, because of false-positives that 
you might get from it.  So I'm recommending users leave it off anyway.  In 
particular, I worry that NM restarts will trigger blacklisting.  Even if that 
should be covered under timeouts, users may misconfigure things, or maybe they 
intentionally leave small timeouts for streaming etc.  I could even see this 
resulting in blacklisting the entire cluster.

I'm still missing data points on what happens when people use this in practice, 
whether it ends up with false-positives, or helps prevent other issues.  I'd be 
interested in hearing more feedback on others experience with it.

> Job will always fail in the external shuffle service unavailable situation
> --------------------------------------------------------------------------
>
>                 Key: SPARK-13669
>                 URL: https://issues.apache.org/jira/browse/SPARK-13669
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core, YARN
>            Reporter: Saisai Shao
>            Assignee: Saisai Shao
>             Fix For: 2.3.0
>
>
> Currently we are running into an issue with Yarn work preserving enabled + 
> external shuffle service. 
> In the work preserving enabled scenario, the failure of NM will not lead to 
> the exit of executors, so executors can still accept and run the tasks. The 
> problem here is when NM is failed, external shuffle service is actually 
> inaccessible, so reduce tasks will always complain about the “Fetch failure”, 
> and the failure of reduce stage will make the parent stage (map stage) rerun. 
> The tricky thing here is Spark scheduler is not aware of the unavailability 
> of external shuffle service, and will reschedule the map tasks on the 
> executor where NM is failed, and again reduce stage will be failed with 
> “Fetch failure”, and after 4 retries, the job is failed.
> So here the main problem is that we should avoid assigning tasks to those bad 
> executors (where shuffle service is unavailable). Current Spark's blacklist 
> mechanism could blacklist executors/nodes by failure tasks, but it doesn't 
> handle this specific fetch failure scenario. So here propose to improve the 
> current application blacklist mechanism to handle fetch failure issue 
> (especially with external shuffle service unavailable issue), to blacklist 
> the executors/nodes where shuffle fetch is unavailable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation

Reply via email to