If steps 6 and 7 happen before the Tez AM shuts down then the AM will not exit for a long time. This is because shutting down the mini cluster shuts down the NM but may not shut down the AM. The AM after clean up will try to unregister from the RM before shutting itself down. Since the RM is already gone, the unregister will keep retrying for some time (for the High availability case since the RM may have just crashed and will come back up). So you will see the AM process hanging around for some time.
You can confirm this by checking that when the AM is hanging around, are the NM and RM processes gone. And checking for this message in the AM logs “Waiting for application to be successful” Bikas *From:* Subroto Sanyal [mailto:sanyalsubr...@gmail.com] *Sent:* Wednesday, June 11, 2014 1:52 AM *To:* dev@tez.incubator.apache.org *Subject:* Re: Deadlock in DAGAppMaster during shutdown. Hi Bikas, Hitesh The Tezsession.stop() is invoked as part of my Client flow. order of execution: 1) Create MiniTezCluster 2) Create Tez Session 3) Create DAG 4) Submit DAG to Tez Session and wait for completion 5) Repeat step 4 for different DAGs 6) Stop Tez Session 7) Stop MiniTezCluster PFA the container logs and thread-dump of DAGAppMaster On Wed, Jun 11, 2014 at 1:23 AM, Bikas Saha <bi...@hortonworks.com> wrote: Can you please clarify TezSession is stopped? Has TezSession.stop() been called? If not then the session app on the cluster will not stop. It will stop after its been idle (no DAG running) for a configurable timeout period. If TezSession.stop() has been called then the AM might keep running and clean up existing running tasks etc. Then exit when this cleanup is done. TezSession.stop() is not blocking on the client. So the method can return before the app exits. Bikas -----Original Message----- From: Subroto Sanyal [mailto:sanyalsubr...@gmail.com] Sent: Tuesday, June 10, 2014 3:27 PM To: dev@tez.incubator.apache.org Subject: Re: Deadlock in DAGAppMaster during shutdown. Hi, I have build the Tez jars from the git repository today; still, I see the DAGAppMaster running even after the TezSession is stopped. Do I need to get the code/jar from somewhere else to get the fix reflected? On Tue, Jun 10, 2014 at 1:54 PM, Subroto Sanyal <sanyalsubr...@gmail.com> wrote: > Hi Oleg, > > > Thanks for confirming. Could you please provide the TEZ jira tickets > for both of the issue where they have been solved. > I couldn't find the code changes for closing TezClient. > > > On Tue, Jun 10, 2014 at 1:25 PM, Oleg Zhurakousky < > ozhurakou...@hortonworks.com> wrote: > >> Subroto >> >> Thanks for pointing this out. >> This and the TezClient issue you’ve pointed out in your previous >> email is actually being actively addressed >> >> Oleg >> >> On Jun 10, 2014, at 5:42 AM, Subroto Sanyal <sanyalsubr...@gmail.com> >> wrote: >> >> > In the class AMRMClientAsyncImpl the object(7c3041e28) is being >> > locked >> by >> > Heartbeat thread(which kinds of run a infinite loop as any >> > heartbeat >> > thread) which is requested to be locked by the method >> > unregisterApplicationMaster. >> > >> > Once the method unregisterApplicationMaster can lock the requested >> object; >> > then only it can notify the heartbeat thread to exit by a boolean >> > flag keepRunning. >> > >> > Following is the thread-dump for the deadlock: >> > >> > "AMShutdownThread" daemon prio=5 tid=7f9a02921800 nid=0x115d68000 >> waiting >> > for monitor entry [115d67000] >> > >> > java.lang.Thread.State: BLOCKED (on object monitor) >> > >> > at >> > >> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.unre >> gisterApplicationMaster(AMRMClientAsyncImpl.java:156) >> > >> > - waiting to lock <7c3041e28> (a java.lang.Object) >> > >> > at >> > >> org.apache.tez.dag.app.rm.TaskScheduler.serviceStop(TaskScheduler.jav >> a:394) >> > >> > - locked <7c3006aa0> (a org.apache.tez.dag.app.rm.TaskScheduler) >> > >> > at >> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2 >> 21) >> > >> > - locked <7c3038008> (a java.lang.Object) >> > >> > at >> > >> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.serviceStop(TaskS >> chedulerEventHandler.java:357) >> > >> > at >> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2 >> 21) >> > >> > - locked <7c2f71360> (a java.lang.Object) >> > >> > at >> > >> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.ja >> va:52) >> > >> > at >> > >> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperat >> ions.java:80) >> > >> > at >> org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:15 >> 18) >> > >> > at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java: >> 1649) >> > >> > - locked <7c2f51790> (a org.apache.tez.dag.app.DAGAppMaster) >> > >> > at >> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:2 >> 21) >> > >> > - locked <7c2fed728> (a java.lang.Object) >> > >> > at >> > >> org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHandler$AMShu >> tdownRunnable.run(DAGAppMaster.java:607) >> > >> > at java.lang.Thread.run(Thread.java:695) >> > >> > >> > "AMRM Heartbeater thread" prio=5 tid=7f9a0c0e8800 nid=0x111e70000 >> waiting >> > on condition [111e6f000] >> > >> > java.lang.Thread.State: TIMED_WAITING (sleeping) >> > >> > at java.lang.Thread.sleep(Native Method) >> > >> > at >> > >> org.apache.hadoop.util.ThreadUtil.sleepAtLeastIgnoreInterrupts(Thread >> Util.java:43) >> > >> > at >> > >> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocat >> ionHandler.java:150) >> > >> > at com.sun.proxy.$Proxy9.allocate(Unknown Source) >> > >> > at >> > >> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMCl >> ientImpl.java:246) >> > >> > at >> > >> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$Hear >> tbeatThread.run(AMRMClientAsyncImpl.java:224) >> > >> > - locked <7c3041e28> (a java.lang.Object) >> > >> > *public void unregisterApplicationMaster(FinalApplicationStatus >> appStatus,* >> > >> > * String appMessage, String appTrackingUrl) throws YarnException,* >> > >> > * IOException {* >> > >> > * synchronized (unregisterHeartbeatLock) {* >> > >> > * keepRunning = false;* >> > >> > * client.unregisterApplicationMaster(appStatus, appMessage, >> > appTrackingUrl);* >> > >> > * }* >> > >> > * }* >> > >> > >> > The line "keepRunning = false" should be outside the synchronized >> > block. >> > >> > I am not sure this should be regarded as problem in yarn or TEZ. >> > The >> flag >> > is private and can't be accessed by Tez implementation >> TezAMRMClientAsync. >> > >> > >> > -- >> > Cheers, >> > *Subroto Sanyal* >> >> >> -- >> CONFIDENTIALITY NOTICE >> NOTICE: This message is intended for the use of the individual or >> entity to which it is addressed and may contain information that is >> confidential, privileged and exempt from disclosure under applicable >> law. If the reader of this message is not the intended recipient, you >> are hereby notified that any printing, copying, dissemination, >> distribution, disclosure or forwarding of this communication is >> strictly prohibited. If you have received this communication in >> error, please contact the sender immediately and delete it from your >> system. Thank You. >> > > > > -- > Cheers, > *Subroto Sanyal* > -- Cheers, *Subroto Sanyal* -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- Cheers, *Subroto Sanyal* -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.