[ https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matei Zaharia updated MAPREDUCE-1436: ------------------------------------- Attachment: mapreduce-1436-v2.patch Here's a new patch that always locks the JobTracker before locking the FairScheduler in update(). This should resolve both of the deadlocks reported above. I've also increased the default update interval from 0.5 seconds to 2.5 seconds in this patch. The only negative impact of this should be that preemption and speculation take slightly longer to kick in. These are really the only reasons we need to call update() other than when jobs are added and removed; speculative tasks are counted in updateDemand, and preemption is checked regularly in updatePreemptionVariables(). I've also thought a bit about the impact of coarser locking on performance of the JobTracker, and I think it's actually not that much. First of all, since assignTasks already locks the FairScheduler, we wouldn't get much farther by locking only the FS in update() and not the JT, because the JT calls assignTasks on every heartbeat anyway. Second, I timed update() on a simulated cluster with 2500 nodes, 4 slots per node, 100 jobs and 20 pools, and one call to update() took about 50 ms. With the new default update interval of 2500 ms, only 2% of the time in the JobTracker should be spent on this (and for such a large cluster, the update interval can be upped through the config file anyway). > Deadlock in preemption code in fair scheduler > --------------------------------------------- > > Key: MAPREDUCE-1436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1436 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: contrib/fair-share > Affects Versions: 0.21.0, 0.22.0 > Reporter: Matei Zaharia > Assignee: Matei Zaharia > Priority: Blocker > Attachments: deadlock.png, mapreduce-1436-v2.patch, > mapreduce-1436.patch > > > In testing the fair scheduler with preemption, I found a deadlock between > updatePreemptionVariables and some code in the JobTracker. This was found > while testing a backport of the fair scheduler to Hadoop 0.20, but it looks > like it could also happen in trunk and 0.21. Details are in a comment below. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.