Github user sitalkedia commented on the pull request:

    https://github.com/apache/spark/pull/11167#issuecomment-183006784
  
    As you can see from the jstack of the driver  http://pastebin.com/m8CP6VMv. 
The dag-scheduler-event-loop thread has taken a lock and is spending a lot of 
time in the addPendingTask function. For each task added, It is iterating over 
the list of tasks to check for duplicates. Which becomes an o(n2) operation and 
when the number of tasks is huge, it takes more than 5 minutes. 
    
    As mentioned in the comment, the addPendingTask function does not really 
need to check for duplicates because dequeueTaskFromList will skip already 
running tasks. If we remove the duplicate check from addPendingTask function, 
then the time period for which the lock is held is very short and things are 
working fine.  We can not make this as a set because we treat the list of 
pending tasks as a stack, please see 
https://github.com/sitalkedia/spark/blob/fix_stuck_driver/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L113
 . 
    
    Please note that this is a regression from Spark 1.5 and is introduced in 
https://github.com/facebook/FB-Spark/commit/3535b91ddc9fd05b613a121e09263b0f378bd5fa#diff-bad3987c83bd22d46416d3dd9d208e76L789
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to