[
https://issues.apache.org/jira/browse/TEZ-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164051#comment-14164051
]
Bikas Saha commented on TEZ-1643:
---------------------------------
What can be done about this other than shutting down? If YARN's own AMRMclient
has given up on the RM? Maybe we could add some retries but nothing would stop
the AMRMClient from failing again. Without RM HA the client the RM will not be
able to resync on the allocation quota/status and the RM (after restart) will
ask all containers to be killed (including the AM).
We could continue to run the job with existing containers instead of failing
the DAG and hope to finish some (or all work) while the RM is unavailable. Once
the RM comes back we will be killed (and restarted).
In HA scenarios the client should wait much longer for the RM to come back up.
So this jira may be a wont fix for non-HA cases.
> DAGAppMaster kills DAG & shuts down, when RM is restarted
> ---------------------------------------------------------
>
> Key: TEZ-1643
> URL: https://issues.apache.org/jira/browse/TEZ-1643
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Priority: Critical
>
> Scenario:
> 1. Start a long running job
> 2. Kill RM (recovery is enabled in RM. No RM-HA configured)
> 3. AMRMClientAsyncImpl$HeartbeatThread throws error (EOFException) which
> internally causes the appmaster to kill DAG.
> 2014-10-08 02:24:06,705 INFO [IPC Server handler 6 on 55291]
> org.apache.tez.dag.app.dag.impl.TaskImpl:
> TaskAttempt:attempt_1412734988643_0001_1_00_000000_0 sent events: (0-1)
> 2014-10-08 02:24:12,255 ERROR [AMRM Heartbeater thread]
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Exception
> on heartbeat
> java.io.IOException: Failed on local exception: java.io.EOFException; Host
> Details : local host is: "m-tez-uns-try-3/1.1.1.1"; destination host is: "
> m-tez-uns-try-3":8030;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy27.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy28.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
> at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
> Caused by: java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
> 2014-10-08 02:24:12,256 INFO [AMRM Callback Handler Thread]
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Interrupted
> while waiting for queue
> java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
> at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:274)
> 2014-10-08 02:24:12,257 ERROR [AMRM Callback Handler Thread]
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Stopping
> callback due to:
> java.io.IOException: Failed on local exception: java.io.EOFException; Host
> Details : local host is: "m-tez-uns-try-3/1.1.1.1"; destination host is:
> "m-tez-uns-try-3":8030;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy27.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy28.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
> at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
> Caused by: java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
> 2014-10-08 02:24:12,257 INFO [TaskSchedulerAppCaller #0]
> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: Error reported by
> scheduler
> 2014-10-08 02:24:12,258 INFO [AsyncDispatcher event handler]
> org.apache.tez.common.TezUtilsInternal: Redirecting log file based on addend:
> dag_1412734988643_0001_1_post
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)