GitHub user pgandhi999 opened a pull request:

    https://github.com/apache/spark/pull/22221

    [SPARK-25231] : Executor Heartbeat Receiver does not need to synchron…

    …ize on the TaskSchedulerImpl object
    
    Running a large Spark job with speculation turned on was causing executor 
heartbeats to time out on the driver end after sometime and eventually, after 
hitting the max number of executor failures, the job would fail. 
    
    ## What changes were proposed in this pull request?
    
    The main reason for the heartbeat timeouts was that the 
heartbeat-receiver-event-loop-thread was blocked waiting on the 
TaskSchedulerImpl object which was being held by one of the 
dispatcher-event-loop threads executing the method dequeueSpeculativeTasks() in 
TaskSetManager.scala. On further analysis of the heartbeat receiver method, it 
turns out there is no need to hold the lock on the whole object. The block of 
code in the method only uses  one global HashMap taskIdToTaskSetManager. Making 
that map a ConcurrentHashMap, we are ensuring atomicity of operations and 
speeding up the heartbeat receiver thread operation.
    
    ## How was this patch tested?
    
    Screenshots of the thread dump have been attached below:
    **heartbeat-receiver-event-loop-thread:**
    
    <img width="1409" alt="screen shot 2018-08-24 at 9 19 57 am" 
src="https://user-images.githubusercontent.com/22228190/44593413-e25df780-a788-11e8-9520-176a18401a59.png";>
    
    **dispatcher-event-loop-thread:**
    
    <img width="1409" alt="screen shot 2018-08-24 at 9 21 56 am" 
src="https://user-images.githubusercontent.com/22228190/44593484-13d6c300-a789-11e8-8d88-34b1d51d4541.png";>
    
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/pgandhi999/spark SPARK-25231

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22221.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22221
    
----
commit a0dcde583c76cb96f5112f4ff863874415ec9140
Author: pgandhi <pgandhi@...>
Date:   2018-08-24T15:27:01Z

    [SPARK-25231] : Executor Heartbeat Receiver does not need to synchronize on 
the TaskSchedulerImpl object
    
    The main reason for the heartbeat timeouts was that the 
heartbeat-receiver-event-loop-thread was blocked waiting on the 
TaskSchedulerImpl object which was being held by one of the 
dispatcher-event-loop threads executing the method dequeueSpeculativeTasks() in 
TaskSetManager.scala. On further analysis of the heartbeat receiver method, it 
turns out there is no need to hold the lock on the whole object. The block of 
code in the method only uses  one global HashMap taskIdToTaskSetManager. Making 
that map a ConcurrentHashMap, we are ensuring atomicity of operations and 
speeding up the heartbeat receiver thread operation.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to