[
https://issues.apache.org/jira/browse/TEZ-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491781#comment-14491781
]
Rohini Palaniswamy edited comment on TEZ-2300 at 4/12/15 11:45 PM:
-------------------------------------------------------------------
This one was a case of AM lingering long due to posting to ATS. [~jlowe], had
configured the tez.yarn.ats.max.events.per.batch to 500 in tez-site.xml as we
had earlier noticed that even on normal shutdown AM was lingering for a lot of
time when there are lot of events. But in this case it is still worse taking
more than 5 mins. I believe it is due to ATS put errors and retries.
2015-04-12 23:24:42,040 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 51890 Succeeded: 0 Running: 50 Failed: 0
Killed: 0, diagnostics=, counters=null
Did Ctlr+C here on the pig client.
AM DAG log:
2015-04-12 23:24:52,015 INFO [AsyncDispatcher event handler] app.DAGAppMaster:
DAG completed, dagId=dag_1428329756093_325099_1, dagState=KILLED
2015-04-12 23:24:52,015 INFO [AsyncDispatcher event handler]
common.TezUtilsInternal: Redirecting log file based on addend:
dag_1428329756093_325099_1_post
In the dag_1428329756093_325099_1_post log (attached to the jira as well)
{code}
2015-04-12 23:24:57,029 INFO [AMShutdownThread] ats.ATSHistoryLoggingService:
Stopping ATSService, eventQueueBacklog=17927
2015-04-12 23:25:25,466 WARN [AMShutdownThread] ats.ATSHistoryLoggingService:
ATSService being stopped, eventQueueBacklog=17927, maxTimeLeftToFlush=-1,
waitForever=true
.... Lot of ATS put errors ....
2015-04-12 23:32:53,197 INFO [AMShutdownThread] ats.ATSHistoryLoggingService:
Event queue empty, stopping ATS Service
2015-04-12 23:32:53,200 INFO [DelayedContainerManager]
rm.YarnTaskSchedulerService: AllocatedContainerManager Thread interrupted
2015-04-12 23:32:53,203 INFO [AMShutdownThread] rm.YarnTaskSchedulerService:
Unregistering application from RM, exitStatus=SUCCEEDED, exitMessage=Session
stats:submittedDAGs=1, successfulDAGs=0, failedDAGs=0, killedDAGs=1
,
trackingURL=bassniumtan-jt1.tan.ygrid.yahoo.com:4080/tez/#/?appid=application_1428329756093_325099
2015-04-12 23:32:53,210 INFO [AMShutdownThread] impl.AMRMClientImpl: Waiting
for application to be successfully unregistered.
2015-04-12 23:32:53,314 INFO [AMShutdownThread] rm.YarnTaskSchedulerService:
Successfully unregistered application from RM
2015-04-12 23:32:53,315 INFO [AMRM Callback Handler Thread]
impl.AMRMClientAsyncImpl: Interrupted while waiting for queue
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:274)
2015-04-12 23:32:53,316 INFO [AMShutdownThread] ipc.Server: Stopping server on
51921
2015-04-12 23:32:53,319 INFO [IPC Server listener on 51921] ipc.Server:
Stopping IPC Server listener on 51921
2015-04-12 23:32:53,319 INFO [AMShutdownThread] ipc.Server: Stopping server on
50500
2015-04-12 23:32:53,320 INFO [IPC Server listener on 50500] ipc.Server:
Stopping IPC Server listener on 50500
2015-04-12 23:32:53,320 INFO [IPC Server Responder] ipc.Server: Stopping IPC
Server Responder
2015-04-12 23:32:53,320 INFO [IPC Server Responder] ipc.Server: Stopping IPC
Server Responder
2015-04-12 23:32:53,324 INFO [AMShutdownThread] app.DAGAppMaster: Completed
deletion of tez scratch data dir,
path=hdfs://bassniumtan-nn1.tan.ygrid.yahoo.com:8020/tmp/temp-1464028011/.tez/application_1428329756093_325099
2015-04-12 23:32:53,324 INFO [AMShutdownThread] app.DAGAppMaster: Exiting
DAGAppMaster..GoodBye!
2015-04-12 23:32:53,325 INFO [Thread-1] app.DAGAppMaster:
DAGAppMasterShutdownHook invoked
{code}
Jstack still running thread on AM :
{code}
"AMShutdownThread" prio=10 tid=0x0f3ad800 nid=0x2b0e runnable [0x04e40000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:150)
at java.net.SocketInputStream.read(SocketInputStream.java:121)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked <0xde56b700> (a java.io.BufferedInputStream)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:633)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:579)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1322)
- locked <0xde42d320> (a sun.net.www.protocol.http.HttpURLConnection)
at
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
at
com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:240)
at
com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:230)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:166)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:241)
at com.sun.jersey.api.client.Client.handle(Client.java:648)
at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
at
com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:461)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:319)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:316)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:316)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:301)
at
org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.handleEvents(ATSHistoryLoggingService.java:304)
at
org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.serviceStop(ATSHistoryLoggingService.java:199)
- locked <0x1481e8a0> (a java.lang.Object)
at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x1481e7d0> (a java.lang.Object)
at
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at
org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
at
org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
at
org.apache.tez.dag.history.HistoryEventHandler.serviceStop(HistoryEventHandler.java:80)
at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x147fd580> (a java.lang.Object)
at
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at
org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1535)
at
org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:1674)
- locked <0x145e1238> (a org.apache.tez.dag.app.DAGAppMaster)
at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x145e1378> (a java.lang.Object)
at
org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHandler$AMShutdownRunnable.run(DAGAppMaster.java:724)
at java.lang.Thread.run(Thread.java:722)
{code}
ATS UI shows start and end time as 12 Apr 2015 16:24:20 and 12 Apr 2015
16:24:51 for the DAG and status as KILLED. But the AM started at 12-Apr-2015
23:24:05 and only finished after 8mins, 47sec. Not sure why there are so many
ATS put errors. It is those retries that are causing so much time.
was (Author: rohini):
This one was a case of AM lingering long due to posting to ATS. [~jlowe], did
configure the tez.yarn.ats.max.events.per.batch to 500 in tez-site.xml to make
the problem better.
2015-04-12 23:24:42,040 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 51890 Succeeded: 0 Running: 50 Failed: 0
Killed: 0, diagnostics=, counters=null
Did Ctlr+C here on the pig client.
AM DAG log:
2015-04-12 23:24:52,015 INFO [AsyncDispatcher event handler] app.DAGAppMaster:
DAG completed, dagId=dag_1428329756093_325099_1, dagState=KILLED
2015-04-12 23:24:52,015 INFO [AsyncDispatcher event handler]
common.TezUtilsInternal: Redirecting log file based on addend:
dag_1428329756093_325099_1_post
In the dag_1428329756093_325099_1_post log (attached to the jira as well)
{code}
2015-04-12 23:24:57,029 INFO [AMShutdownThread] ats.ATSHistoryLoggingService:
Stopping ATSService, eventQueueBacklog=17927
2015-04-12 23:25:25,466 WARN [AMShutdownThread] ats.ATSHistoryLoggingService:
ATSService being stopped, eventQueueBacklog=17927, maxTimeLeftToFlush=-1,
waitForever=true
.... Lot of ATS put errors ....
2015-04-12 23:32:53,197 INFO [AMShutdownThread] ats.ATSHistoryLoggingService:
Event queue empty, stopping ATS Service
2015-04-12 23:32:53,200 INFO [DelayedContainerManager]
rm.YarnTaskSchedulerService: AllocatedContainerManager Thread interrupted
2015-04-12 23:32:53,203 INFO [AMShutdownThread] rm.YarnTaskSchedulerService:
Unregistering application from RM, exitStatus=SUCCEEDED, exitMessage=Session
stats:submittedDAGs=1, successfulDAGs=0, failedDAGs=0, killedDAGs=1
,
trackingURL=bassniumtan-jt1.tan.ygrid.yahoo.com:4080/tez/#/?appid=application_1428329756093_325099
2015-04-12 23:32:53,210 INFO [AMShutdownThread] impl.AMRMClientImpl: Waiting
for application to be successfully unregistered.
2015-04-12 23:32:53,314 INFO [AMShutdownThread] rm.YarnTaskSchedulerService:
Successfully unregistered application from RM
2015-04-12 23:32:53,315 INFO [AMRM Callback Handler Thread]
impl.AMRMClientAsyncImpl: Interrupted while waiting for queue
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:274)
2015-04-12 23:32:53,316 INFO [AMShutdownThread] ipc.Server: Stopping server on
51921
2015-04-12 23:32:53,319 INFO [IPC Server listener on 51921] ipc.Server:
Stopping IPC Server listener on 51921
2015-04-12 23:32:53,319 INFO [AMShutdownThread] ipc.Server: Stopping server on
50500
2015-04-12 23:32:53,320 INFO [IPC Server listener on 50500] ipc.Server:
Stopping IPC Server listener on 50500
2015-04-12 23:32:53,320 INFO [IPC Server Responder] ipc.Server: Stopping IPC
Server Responder
2015-04-12 23:32:53,320 INFO [IPC Server Responder] ipc.Server: Stopping IPC
Server Responder
2015-04-12 23:32:53,324 INFO [AMShutdownThread] app.DAGAppMaster: Completed
deletion of tez scratch data dir,
path=hdfs://bassniumtan-nn1.tan.ygrid.yahoo.com:8020/tmp/temp-1464028011/.tez/application_1428329756093_325099
2015-04-12 23:32:53,324 INFO [AMShutdownThread] app.DAGAppMaster: Exiting
DAGAppMaster..GoodBye!
2015-04-12 23:32:53,325 INFO [Thread-1] app.DAGAppMaster:
DAGAppMasterShutdownHook invoked
{code}
Jstack still running thread on AM :
{code}
"AMShutdownThread" prio=10 tid=0x0f3ad800 nid=0x2b0e runnable [0x04e40000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:150)
at java.net.SocketInputStream.read(SocketInputStream.java:121)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked <0xde56b700> (a java.io.BufferedInputStream)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:633)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:579)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1322)
- locked <0xde42d320> (a sun.net.www.protocol.http.HttpURLConnection)
at
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
at
com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:240)
at
com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:230)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:166)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:241)
at com.sun.jersey.api.client.Client.handle(Client.java:648)
at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
at
com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:461)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:319)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:316)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:316)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:301)
at
org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.handleEvents(ATSHistoryLoggingService.java:304)
at
org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.serviceStop(ATSHistoryLoggingService.java:199)
- locked <0x1481e8a0> (a java.lang.Object)
at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x1481e7d0> (a java.lang.Object)
at
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at
org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
at
org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
at
org.apache.tez.dag.history.HistoryEventHandler.serviceStop(HistoryEventHandler.java:80)
at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x147fd580> (a java.lang.Object)
at
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at
org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1535)
at
org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:1674)
- locked <0x145e1238> (a org.apache.tez.dag.app.DAGAppMaster)
at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x145e1378> (a java.lang.Object)
at
org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHandler$AMShutdownRunnable.run(DAGAppMaster.java:724)
at java.lang.Thread.run(Thread.java:722)
{code}
ATS UI shows start and end time as 12 Apr 2015 16:24:20 and 12 Apr 2015
16:24:51 for the DAG and status as KILLED. But the AM started at 12-Apr-2015
23:24:05 and only finished after 8mins, 47sec. Not sure why there are so many
ATS put errors. It is those retries that are causing so much time.
> TezClient.stop() takes a lot of time or does not work sometimes
> ---------------------------------------------------------------
>
> Key: TEZ-2300
> URL: https://issues.apache.org/jira/browse/TEZ-2300
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rohini Palaniswamy
> Attachments: syslog_dag_1428329756093_325099_1_post
>
>
> Noticed this with a couple of pig scripts which were not behaving well (AM
> close to OOM, etc) and even with some that were running fine. Pig calls
> Tezclient.stop() in shutdown hook. Ctrl+C to the pig script either exits
> immediately or is hung. In both cases it either takes a long time for the
> yarn application to go to KILLED state. Many times I just end up calling yarn
> application -kill separately after waiting for 5 mins or more for it to get
> killed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)