[
https://issues.apache.org/jira/browse/HADOOP-4977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661478#action_12661478
]
Amar Kamat commented on HADOOP-4977:
------------------------------------
Identified 2 scenarios where the deadlock can happen :-
1) {{ReclaimCapacity}} thread calls {{TaskSchedulingMgr.reclaimCapacity()}}
which internally calls {{updateQSIObjects()}} which internally calls
{{TaskTrackerManager.getClusterStatus()}} which is a _synchronized_ call.
2) {{ReclaimCapacity}} thread calls {{TaskSchedulingMgr.reclaimCapacity()}}
which internally calls {{TaskTrackerManager.getNextHeartbeatInterval()}} which
internally calls {{TaskTrackerManager.getClusterStatus()}} which is a
_synchronized_ call.
Note that this can happen any time whenever a thread (from capacity scheduler)
makes back call to {{TaskTrackerManager}} which tries to take a lock on the
{{TaskTrackerManager}} while the {{TaskTrackerManager}} itself invokes
{{TaskScheduler}}'s api after locking itself. The whole deadlock issue can be
summarized as follows
||Who||Via||Locks?||Needs to lock?||Via||
|JobTracker|JobTracker.heartbeat()|JobTracker(itself)|TaskSchedulingMgr|CapacityTaskScheduler.assignTasks()
calls TaskSchedulingMg.assignTasks() which calls updateQSIObjects() which is a
_synchronized_ call|
|CapacityScheduler.ReclaimCapacityThread|TaskSchedulingMgr.reclaimCapacity()|TaskSchedulingMgr|JobTracker|TaskSchedulingMgr.reclaimCapacity()
calls TaskSchedulingMgr.updateQSIObjects() which calls
JobTracker.getClusterStatus() which is a _synchronized_ call and
TaskSchedulingMgr.reclaimCapacity() which calls
JobTracker.getNextHeartbeatInterval() which is a _synchronized_ call|
Spoke to Hemanth and Vivek on this and we all agree that various cluster
parameters (cluster-status, heartbeat-interval) should be obtained/cached
before invoking {{reclaimCapacity()}} and this should be a common rule while
adding new threads.
> Deadlock between reclaimCapacity and assignTasks
> ------------------------------------------------
>
> Key: HADOOP-4977
> URL: https://issues.apache.org/jira/browse/HADOOP-4977
> Project: Hadoop Core
> Issue Type: Bug
> Components: contrib/capacity-sched
> Affects Versions: 0.19.0
> Reporter: Matei Zaharia
> Assignee: Amar Kamat
> Priority: Blocker
> Fix For: 0.20.0
>
> Attachments: jstack.txt
>
>
> I was running the latest trunk with the capacity scheduler and saw the
> JobTracker lock up with the following deadlock reported in jstack:
> Found one Java-level deadlock:
> =============================
> "18107...@qtp0-4":
> waiting to lock monitor 0x08085b40 (object 0x56605100, a
> org.apache.hadoop.mapred.JobTracker),
> which is held by "IPC Server handler 4 on 54311"
> "IPC Server handler 4 on 54311":
> waiting to lock monitor 0x0808594c (object 0x5660e518, a
> org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr),
> which is held by "reclaimCapacity"
> "reclaimCapacity":
> waiting to lock monitor 0x08085b40 (object 0x56605100, a
> org.apache.hadoop.mapred.JobTracker),
> which is held by "IPC Server handler 4 on 54311"
> Java stack information for the threads listed above:
> ===================================================
> "18107...@qtp0-4":
> at
> org.apache.hadoop.mapred.JobTracker.getClusterStatus(JobTracker.java:2695)
> - waiting to lock <0x56605100> (a org.apache.hadoop.mapred.JobTracker)
> at
> org.apache.hadoop.mapred.jobtracker_jsp._jspService(jobtracker_jsp.java:93)
> at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
> at
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
> at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
> at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
> at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.handle(Server.java:324)
> at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
> at
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
> at
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
> at
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
> "IPC Server handler 4 on 54311":
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.updateQSIObjects(CapacityTaskScheduler.java:564)
> - waiting to lock <0x5660e518> (a
> org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:855)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$1000(CapacityTaskScheduler.java:294)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1336)
> - locked <0x5660dd20> (a org.apache.hadoop.mapred.CapacityTaskScheduler)
> at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2288)
> - locked <0x56605100> (a org.apache.hadoop.mapred.JobTracker)
> at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
> Unfortunately I didn't manage to select all of the output by mistake, so some
> is missing, but it appears that reclaimCapacity locks the MapSchedulingMgr
> and then tries to lock the JobTracker, whereas the updateQSIObjects called in
> assignTasks holds a lock on the JobTracker (the JobTracker grabs this lock
> when it calls assignTasks) and then tries to lock the MapSchedulingMgr. The
> other thread listed there is a Jetty thread for the web interface and isn't
> part of the circular locking. The solution to this would be to lock the
> JobTracker in reclaimCapacity before locking anything else.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.