[
https://issues.apache.org/jira/browse/SPARK-16929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859175#comment-15859175
]
Apache Spark commented on SPARK-16929:
--------------------------------------
User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/16867
> Speculation-related synchronization bottleneck in checkSpeculatableTasks
> ------------------------------------------------------------------------
>
> Key: SPARK-16929
> URL: https://issues.apache.org/jira/browse/SPARK-16929
> Project: Spark
> Issue Type: Bug
> Components: Scheduler
> Reporter: Nicholas Brown
>
> Our cluster has been running slowly since I got speculation working, I looked
> into it and noticed that stderr was saying some tasks were taking almost an
> hour to run even though in the application logs on the nodes that task only
> took a minute or so to run. Digging into the thread dump for the master node
> I noticed a number of threads are blocked, apparently by speculation thread.
> At line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while
> it looks through the tasks to see what needs to be rerun. Unfortunately that
> code loops through each of the tasks, so when you have even just a couple
> hundred thousand tasks to run that can be prohibitively slow to run inside of
> a synchronized block. Once I disabled speculation, the job went back to
> having acceptable performance.
> There are no comments around that lock indicating why it was added, and the
> git history seems to have a couple refactorings so its hard to find where it
> was added. I'm tempted to believe it is the result of someone assuming that
> an extra synchronized block never hurt anyone (in reality I've probably just
> as many bugs caused by over synchronization as too little) as it looks too
> broad to be actually guarding any potential concurrency issue. But, since
> concurrency issues can be tricky to reproduce (and yes, I understand that's
> an extreme understatement) I'm not sure just blindly removing it without
> being familiar with the history is necessarily safe.
> Can someone look into this? Or at least make a note in the documentation
> that speculation should not be used with large clusters?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]