[
https://issues.apache.org/jira/browse/TEZ-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174326#comment-14174326
]
Hitesh Shah commented on TEZ-1643:
----------------------------------
bq. What will happen to the running tasks? When the AM restarts, will the RM
resync it with running containers or will it kill existing containers?
That depends on how recovery handles work-preserving restart. At this point, I
believe the recovery code does not support work preserving restarts and I am
assuming the RM will kill all the previous attempt's containers.
Will take a look at implementing the test along the lines suggested.
> DAGAppMaster kills DAG & shuts down, when RM is restarted
> ---------------------------------------------------------
>
> Key: TEZ-1643
> URL: https://issues.apache.org/jira/browse/TEZ-1643
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Assignee: Hitesh Shah
> Attachments: TEZ-1643.3.patch, TEZ-1643.4.patch,
> TEZ-1643.wip.2.patch, TEZ-1643.wip.patch
>
>
> Scenario:
> 1. Start a long running job
> 2. Kill RM (recovery is enabled in RM. No RM-HA configured)
> 3. AMRMClientAsyncImpl$HeartbeatThread throws error (EOFException) which
> internally causes the appmaster to kill DAG.
> 2014-10-08 02:24:06,705 INFO [IPC Server handler 6 on 55291]
> org.apache.tez.dag.app.dag.impl.TaskImpl:
> TaskAttempt:attempt_1412734988643_0001_1_00_000000_0 sent events: (0-1)
> 2014-10-08 02:24:12,255 ERROR [AMRM Heartbeater thread]
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Exception
> on heartbeat
> java.io.IOException: Failed on local exception: java.io.EOFException; Host
> Details : local host is: "m-tez-uns-try-3/1.1.1.1"; destination host is: "
> m-tez-uns-try-3":8030;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy27.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy28.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
> at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
> Caused by: java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
> 2014-10-08 02:24:12,256 INFO [AMRM Callback Handler Thread]
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Interrupted
> while waiting for queue
> java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
> at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:274)
> 2014-10-08 02:24:12,257 ERROR [AMRM Callback Handler Thread]
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Stopping
> callback due to:
> java.io.IOException: Failed on local exception: java.io.EOFException; Host
> Details : local host is: "m-tez-uns-try-3/1.1.1.1"; destination host is:
> "m-tez-uns-try-3":8030;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy27.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy28.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
> at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
> Caused by: java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
> 2014-10-08 02:24:12,257 INFO [TaskSchedulerAppCaller #0]
> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: Error reported by
> scheduler
> 2014-10-08 02:24:12,258 INFO [AsyncDispatcher event handler]
> org.apache.tez.common.TezUtilsInternal: Redirecting log file based on addend:
> dag_1412734988643_0001_1_post
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)