[
https://issues.apache.org/jira/browse/SLIDER-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098836#comment-15098836
]
Sergey Shelukhin commented on SLIDER-1052:
------------------------------------------
Hmm... I don't see how deadlock can be deliberate. Unless jstack deadlock
detection is broken, it's a real deadlock, i.e. these threads would never
proceed, right?
> Deadlock in slider AM
> ---------------------
>
> Key: SLIDER-1052
> URL: https://issues.apache.org/jira/browse/SLIDER-1052
> Project: Slider
> Issue Type: Bug
> Affects Versions: Slider 0.80
> Reporter: Sergey Shelukhin
> Priority: Critical
>
> I have a hung slider AM in the following state.
> The first app attempt failed to start, so this is the 2nd one. -However, the
> 1st app attempt process is still running on the same machine, and it is in a
> state where I cannot jstack it even with -F. I will kill it shortly and see
> what happens. YARN thinks it's killed.-.nm, it was some other process. The
> first container was on a different machine and did die.
> The 2nd attempt received the container death notification for the first one:
> {noformat}
> 2016-01-07 03:59:41,828 [AMRM Callback Handler Thread] INFO
> appmaster.SliderAppMaster - Container Completion for
> containerID=container_e02_1450721565699_0007_01_000001, state=COMPLETE,
> exitStatus=-105, diagnostics=Container killed by the ApplicationMaster.
> Container killed on request. Exit code is 143
> Container exited with a non-zero exit code 143
> {noformat}
> Note that is is from the 2nd container
> (container_e02_1450721565699_0007_02_000001) logs. Jstack for the 2nd
> attempt has the deadlock:
> {noformat}
> Found one Java-level deadlock:
> =============================
> "AMRM Callback Handler Thread":
> waiting to lock Monitor@0x00007f1b953b18b8 (Object@0x00000000c022c6f0, a
> org/apache/slider/server/appmaster/state/AppState),
> which is held by "main"
> "main":
> waiting to lock Monitor@0x00007f1b953b1128 (Object@0x00000000c00db378, a
> org/apache/slider/server/appmaster/SliderAppMaster),
> which is held by "AMRM Callback Handler Thread"
> {noformat}
> The jstack is with -F, so I cannot actually see thread names in the dump, but
> these look like it (not sure about the first one):
> {noformat}
> Thread 11054: (state = BLOCKED)
> -
> org.apache.slider.server.appmaster.state.AppState.onCompletedNode(org.apache.hadoop.yarn.api.records.ContainerStatus)
> @bci=0, line=1534 (Interpreted frame)
> -
> org.apache.slider.server.appmaster.SliderAppMaster.onContainersCompleted(java.util.List)
> @bci=119, line=1606 (Interpreted frame)
> -
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run()
> @bci=141, line=300 (Interpreted frame)
> ...
> Thread 10254: (state = BLOCKED)
> - org.apache.hadoop.service.AbstractService.getConfig() @bci=0, line=403
> (Interpreted frame)
> - org.apache.slider.server.appmaster.SliderAppMaster.getClusterFS() @bci=5,
> line=1369 (Interpreted frame)
> -
> org.apache.slider.server.appmaster.SliderAppMaster.createAndRunCluster(java.lang.String)
> @bci=1291, line=822 (Interpreted frame)
> - org.apache.slider.server.appmaster.SliderAppMaster.runService() @bci=162,
> line=576 (Interpreted frame)
> -
> org.apache.slider.core.main.ServiceLauncher.launchService(org.apache.hadoop.conf.Configuration,
> java.lang.String[], boolean) @bci=128, line=188 (Interpreted frame)
> -
> org.apache.slider.core.main.ServiceLauncher.launchServiceRobustly(org.apache.hadoop.conf.Configuration,
> java.lang.String[]) @bci=4, line=475 (Interpreted frame)
> -
> org.apache.slider.core.main.ServiceLauncher.launchServiceAndExit(java.util.List)
> @bci=21, line=403 (Interpreted frame)
> - org.apache.slider.core.main.ServiceLauncher.serviceMain(java.util.List)
> @bci=143, line=630 (Interpreted frame)
> -
> org.apache.slider.server.appmaster.SliderAppMaster.main(java.lang.String[])
> @bci=24, line=2327 (Interpreted frame)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)