Sital Kedia created SPARK-13279: ----------------------------------- Summary: Spark driver stuck holding a global lock when there are 200k tasks submitted in a stage Key: SPARK-13279 URL: https://issues.apache.org/jira/browse/SPARK-13279 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.0 Reporter: Sital Kedia Fix For: 1.6.0
While running a large pipeline with 200k tasks, we found that the executors were not able to register with the driver because the driver was stuck holding a global lock in TaskSchedulerImpl.submitTasks function. jstack of the driver - http://pastebin.com/m8CP6VMv executor log - http://pastebin.com/2NPS1mXC >From the jstack I see that the thread handing the resource offer from >executors (dispatcher-event-loop-9) is blocked on a lock held by the thread >"dag-scheduler-event-loop" which is iterating over an entire ArrayBuffer when >adding a pending tasks. So when we have 200k pending tasks, because of this >o(n2) operations, the driver is just hung for more than 5 minutes. Solution - Instead of an ArrayBuffer, we can use a LinkedHashSet which will provide us o(1) lookup and also maintain the ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org