Oleksandr Konopko created SPARK-22213:
-----------------------------------------
Summary: Spark to detect slow executors on nodes with problematic
hardware
Key: SPARK-22213
URL: https://issues.apache.org/jira/browse/SPARK-22213
Project: Spark
Issue Type: Improvement
Components: Scheduler
Affects Versions: 2.0.0
Environment: - AWS EMR clusters
- window time is 60s
- several millions of events processed per minute
Reporter: Oleksandr Konopko
Sometimes when new cluster is created it contains 1-2 slow nodes. When average
Task finishes in 5 seconds, it takes up to 50 seconds to finish on slow node.
As a result, batch processing time increases for 45s
In order to avoid that we could use `speculation` feature, but it seems that it
can be improved
- 1st issue with `speculation` is that we do not want to use `speculation` on
all tasks, since we have tens of thousands of them during processing of one
batch. Spawning extra several thousands would not be resource-efficient. I
suggest to create new parameter `spark.speculation.mintime`. This would specify
minimal task run time for speculation to be enabled for this task
- 2nd issue is that even if Spark spawns speculative tasks only for
long-running ones (longer than 10s for example), task on slow node still will
run for some significant time before it is killed. Which still makes batch
processing time bigger than it should be. Solution is to enable `blacklisting`
for slow nodes. With speculation and blacklisting combined, only first 1-2
batches would take more time when expected. After faulty node is blacklisted
batch processing time is as expected
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]