GitHub user pgandhi999 opened a pull request:
https://github.com/apache/spark/pull/22221
[SPARK-25231] : Executor Heartbeat Receiver does not need to synchronâ¦
â¦ize on the TaskSchedulerImpl object
Running a large Spark job with speculation turned on was causing executor
heartbeats to time out on the driver end after sometime and eventually, after
hitting the max number of executor failures, the job would fail.
## What changes were proposed in this pull request?
The main reason for the heartbeat timeouts was that the
heartbeat-receiver-event-loop-thread was blocked waiting on the
TaskSchedulerImpl object which was being held by one of the
dispatcher-event-loop threads executing the method dequeueSpeculativeTasks() in
TaskSetManager.scala. On further analysis of the heartbeat receiver method, it
turns out there is no need to hold the lock on the whole object. The block of
code in the method only uses one global HashMap taskIdToTaskSetManager. Making
that map a ConcurrentHashMap, we are ensuring atomicity of operations and
speeding up the heartbeat receiver thread operation.
## How was this patch tested?
Screenshots of the thread dump have been attached below:
**heartbeat-receiver-event-loop-thread:**
<img width="1409" alt="screen shot 2018-08-24 at 9 19 57 am"
src="https://user-images.githubusercontent.com/22228190/44593413-e25df780-a788-11e8-9520-176a18401a59.png">
**dispatcher-event-loop-thread:**
<img width="1409" alt="screen shot 2018-08-24 at 9 21 56 am"
src="https://user-images.githubusercontent.com/22228190/44593484-13d6c300-a789-11e8-8d88-34b1d51d4541.png">
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/pgandhi999/spark SPARK-25231
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22221.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22221
----
commit a0dcde583c76cb96f5112f4ff863874415ec9140
Author: pgandhi <pgandhi@...>
Date: 2018-08-24T15:27:01Z
[SPARK-25231] : Executor Heartbeat Receiver does not need to synchronize on
the TaskSchedulerImpl object
The main reason for the heartbeat timeouts was that the
heartbeat-receiver-event-loop-thread was blocked waiting on the
TaskSchedulerImpl object which was being held by one of the
dispatcher-event-loop threads executing the method dequeueSpeculativeTasks() in
TaskSetManager.scala. On further analysis of the heartbeat receiver method, it
turns out there is no need to hold the lock on the whole object. The block of
code in the method only uses one global HashMap taskIdToTaskSetManager. Making
that map a ConcurrentHashMap, we are ensuring atomicity of operations and
speeding up the heartbeat receiver thread operation.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]