[ 
https://issues.apache.org/jira/browse/SPARK-22213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890081#comment-16890081
 ] 

Yuri Ronin commented on SPARK-22213:
------------------------------------

[~hyukjin.kwon] thanks

> Spark to detect slow executors on nodes with problematic hardware
> -----------------------------------------------------------------
>
>                 Key: SPARK-22213
>                 URL: https://issues.apache.org/jira/browse/SPARK-22213
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>    Affects Versions: 2.0.0
>         Environment: - AWS EMR clusters 
> - window time is 60s
> - several millions of events processed per minute
>            Reporter: Oleksandr Konopko
>            Priority: Major
>              Labels: bulk-closed
>
> Sometimes when new cluster is created it contains 1-2 slow nodes. When 
> average Task finishes in 5 seconds, it takes up to 50 seconds to finish on 
> slow node. As a result, batch processing time increases for 45s
> In order to avoid that we could use `speculation` feature, but it seems that 
> it can be improved
>  
> - 1st issue with `speculation` is that we do not want to use `speculation` on 
> all tasks, since we have tens of thousands of them during processing of one 
> batch. Spawning extra several thousands would not be resource-efficient. I 
> suggest to create new parameter `spark.speculation.mintime`. This would 
> specify minimal task run time for speculation to be enabled for this task
> - 2nd issue is that even if Spark spawns speculative tasks only for 
> long-running ones (longer than 10s for example), task on slow node still will 
> run for some significant time before it is killed. Which still makes batch 
> processing time bigger than it should be. Solution is to enable 
> `blacklisting` for slow nodes. With speculation and blacklisting combined, 
> only first 1-2 batches would take more time when expected. After faulty node 
> is blacklisted batch processing time is as expected



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to