[
https://issues.apache.org/jira/browse/TEZ-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174322#comment-14174322
]
Bikas Saha commented on TEZ-1643:
---------------------------------
The event name could be more generic than AMRMClientError. e.g.
SchedulingServiceShutdown.
What will happen to the running tasks? When the AM restarts, will the RM resync
it with running containers or will it kill existing containers?
For the test, we could start the MockDAGAppMaster with a single vertex DAG.
Wait for all tasks to be "launched" by the MockLauncher and then have the test
code send the SchedulerShutdown event. Then check that the DAGAppMaster has
shutdown but the DAG is still in RUNNING state.
> DAGAppMaster kills DAG & shuts down, when RM is restarted
> ---------------------------------------------------------
>
> Key: TEZ-1643
> URL: https://issues.apache.org/jira/browse/TEZ-1643
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Assignee: Hitesh Shah
> Attachments: TEZ-1643.3.patch, TEZ-1643.4.patch,
> TEZ-1643.wip.2.patch, TEZ-1643.wip.patch
>
>
> Scenario:
> 1. Start a long running job
> 2. Kill RM (recovery is enabled in RM. No RM-HA configured)
> 3. AMRMClientAsyncImpl$HeartbeatThread throws error (EOFException) which
> internally causes the appmaster to kill DAG.
> 2014-10-08 02:24:06,705 INFO [IPC Server handler 6 on 55291]
> org.apache.tez.dag.app.dag.impl.TaskImpl:
> TaskAttempt:attempt_1412734988643_0001_1_00_000000_0 sent events: (0-1)
> 2014-10-08 02:24:12,255 ERROR [AMRM Heartbeater thread]
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Exception
> on heartbeat
> java.io.IOException: Failed on local exception: java.io.EOFException; Host
> Details : local host is: "m-tez-uns-try-3/1.1.1.1"; destination host is: "
> m-tez-uns-try-3":8030;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy27.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy28.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
> at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
> Caused by: java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
> 2014-10-08 02:24:12,256 INFO [AMRM Callback Handler Thread]
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Interrupted
> while waiting for queue
> java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
> at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:274)
> 2014-10-08 02:24:12,257 ERROR [AMRM Callback Handler Thread]
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Stopping
> callback due to:
> java.io.IOException: Failed on local exception: java.io.EOFException; Host
> Details : local host is: "m-tez-uns-try-3/1.1.1.1"; destination host is:
> "m-tez-uns-try-3":8030;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy27.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy28.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
> at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
> Caused by: java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
> 2014-10-08 02:24:12,257 INFO [TaskSchedulerAppCaller #0]
> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: Error reported by
> scheduler
> 2014-10-08 02:24:12,258 INFO [AsyncDispatcher event handler]
> org.apache.tez.common.TezUtilsInternal: Redirecting log file based on addend:
> dag_1412734988643_0001_1_post
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)