[jira] [Commented] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation

Venkata krishnan Sowrirajan (Jira) Mon, 13 Apr 2020 16:16:50 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082731#comment-17082731
 ]


Venkata krishnan Sowrirajan commented on SPARK-31418:
-----------------------------------------------------

[~tgraves] Currently, I'm thinking we can check if dynamic allocation is 
enabled if so we can request for one more executor using 
ExecutorAllocationClient#requestExecutors and start the abort timer. But I 
re-read your 
[comment|https://issues.apache.org/jira/browse/SPARK-22148?focusedCommentId=17078278&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17078278]
 again and it seems like you tried to pass the information to 
ExecutorAllocationManager and request the executor through 
ExecutorAllocationManager. Is that right?

Regarding, kill other non idle blacklisted executor idea, I don't think that 
would be better as we might kill tasks from other stages like mentioned in 
other comments from the PR. Let me know if you have any other thoughts on this 
problem. But we are facing this issue more frequently although retrying the 
whole job will pass but it happens frequently.

> Blacklisting feature aborts Spark job without retrying for max num retries in 
> case of Dynamic allocation
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31418
>                 URL: https://issues.apache.org/jira/browse/SPARK-31418
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.0, 2.4.5
>            Reporter: Venkata krishnan Sowrirajan
>            Priority: Major
>
> With Spark blacklisting, if a task fails on an executor, the executor gets 
> blacklisted for the task. In order to retry the task, it checks if there are 
> idle blacklisted executor which can be killed and replaced to retry the task 
> if not it aborts the job without doing max retries.
> In the context of dynamic allocation this can be better, instead of killing 
> the blacklisted idle executor (its possible there are no idle blacklisted 
> executor), request an additional executor and retry the task.
> This can be easily reproduced with a simple job like below, although this 
> example should fail eventually just to show that its not retried 
> spark.task.maxFailures times: 
> {code:java}
> def test(a: Int) = { a.asInstanceOf[String] }
> sc.parallelize(1 to 10, 10).map(x => test(x)).collect 
> {code}
> with dynamic allocation enabled and min executors set to 1. But there are 
> various other cases where this can fail as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation

Reply via email to