[
https://issues.apache.org/jira/browse/HADOOP-4977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vivek Ratan updated HADOOP-4977:
--------------------------------
Attachment: 4977.1.patch
Attaching patch (4977.1.patch). As Amar points out, the problem is that JT loks
itself, then calls the scheduler's assignTasks, which tries getting a lock for
one of the scheduler's objects. In the meantime, a separate thread in the
scheduler locks thsi objects, then calls a TaskTrackerManager method.
TaskTrackerManager is implemented by the JT. Hence the deadlock.
The fix is for threads in the scheduler to call TaskTrackerManager first,
before locking anything in the Scheduler.
I've made the following changes:
* I've moved updateQSIObjects() from TaskSchedulingMgr to CapacitYScheduler. We
may as well update both the map and reduce tasks in one go, rather than do them
separately and walk the list of jobs in a queue twice.
* updateQSI was called in three places: assignTasks (when processing a
heartbeat), the reclaimCapacity thread, and in test cases. In all these calls,
we get the cluster information from TaskTrackerManager first, then update the
QSI objects.
* I renamed one of the methods from updateQSIInfo to updateQSIInfoForTests to
better suggest what it does. Hence the minor changes in
TestCapacityScheduler.java.
> Deadlock between reclaimCapacity and assignTasks
> ------------------------------------------------
>
> Key: HADOOP-4977
> URL: https://issues.apache.org/jira/browse/HADOOP-4977
> Project: Hadoop Core
> Issue Type: Bug
> Components: contrib/capacity-sched
> Affects Versions: 0.19.0
> Reporter: Matei Zaharia
> Assignee: Amar Kamat
> Priority: Blocker
> Fix For: 0.20.0
>
> Attachments: 4977.1.patch, jstack.txt
>
>
> I was running the latest trunk with the capacity scheduler and saw the
> JobTracker lock up with the following deadlock reported in jstack:
> Found one Java-level deadlock:
> =============================
> "18107...@qtp0-4":
> waiting to lock monitor 0x08085b40 (object 0x56605100, a
> org.apache.hadoop.mapred.JobTracker),
> which is held by "IPC Server handler 4 on 54311"
> "IPC Server handler 4 on 54311":
> waiting to lock monitor 0x0808594c (object 0x5660e518, a
> org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr),
> which is held by "reclaimCapacity"
> "reclaimCapacity":
> waiting to lock monitor 0x08085b40 (object 0x56605100, a
> org.apache.hadoop.mapred.JobTracker),
> which is held by "IPC Server handler 4 on 54311"
> Java stack information for the threads listed above:
> ===================================================
> "18107...@qtp0-4":
> at
> org.apache.hadoop.mapred.JobTracker.getClusterStatus(JobTracker.java:2695)
> - waiting to lock <0x56605100> (a org.apache.hadoop.mapred.JobTracker)
> at
> org.apache.hadoop.mapred.jobtracker_jsp._jspService(jobtracker_jsp.java:93)
> at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
> at
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
> at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
> at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
> at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.handle(Server.java:324)
> at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
> at
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
> at
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
> at
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
> "IPC Server handler 4 on 54311":
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.updateQSIObjects(CapacityTaskScheduler.java:564)
> - waiting to lock <0x5660e518> (a
> org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:855)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$1000(CapacityTaskScheduler.java:294)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1336)
> - locked <0x5660dd20> (a org.apache.hadoop.mapred.CapacityTaskScheduler)
> at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2288)
> - locked <0x56605100> (a org.apache.hadoop.mapred.JobTracker)
> at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
> Unfortunately I didn't manage to select all of the output by mistake, so some
> is missing, but it appears that reclaimCapacity locks the MapSchedulingMgr
> and then tries to lock the JobTracker, whereas the updateQSIObjects called in
> assignTasks holds a lock on the JobTracker (the JobTracker grabs this lock
> when it calls assignTasks) and then tries to lock the MapSchedulingMgr. The
> other thread listed there is a Jetty thread for the web interface and isn't
> part of the circular locking. The solution to this would be to lock the
> JobTracker in reclaimCapacity before locking anything else.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.