[ 
https://issues.apache.org/jira/browse/TEZ-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165778#comment-14165778
 ] 

Bikas Saha commented on TEZ-1643:
---------------------------------

[~hitesh] suggests that we could do the simple thing of shutting down the AM 
when this happens. Hopefully the RM (when it comes back up) will restart the 
AM. If we were the last retry then this would be bad because the job would 
fail. This corner case could affect us if we did not need the RM because all 
our allocations were already complete. Ideally the AMRMClient should not be 
erroring out in the first place.

> DAGAppMaster kills DAG & shuts down, when RM is restarted
> ---------------------------------------------------------
>
>                 Key: TEZ-1643
>                 URL: https://issues.apache.org/jira/browse/TEZ-1643
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Critical
>
> Scenario:
> 1. Start a long running job
> 2. Kill RM (recovery is enabled in RM. No RM-HA configured)
> 3. AMRMClientAsyncImpl$HeartbeatThread throws error (EOFException) which 
> internally causes the appmaster to kill DAG.
> 2014-10-08 02:24:06,705 INFO [IPC Server handler 6 on 55291] 
> org.apache.tez.dag.app.dag.impl.TaskImpl: 
> TaskAttempt:attempt_1412734988643_0001_1_00_000000_0 sent events: (0-1)
> 2014-10-08 02:24:12,255 ERROR [AMRM Heartbeater thread] 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Exception 
> on heartbeat
> java.io.IOException: Failed on local exception: java.io.EOFException; Host 
> Details : local host is: "m-tez-uns-try-3/1.1.1.1"; destination host is: "
> m-tez-uns-try-3":8030;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy27.allocate(Unknown Source)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy28.allocate(Unknown Source)
>         at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
>         at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
> Caused by: java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
> 2014-10-08 02:24:12,256 INFO [AMRM Callback Handler Thread] 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Interrupted 
> while waiting for queue
> java.lang.InterruptedException
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
>         at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>         at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:274)
> 2014-10-08 02:24:12,257 ERROR [AMRM Callback Handler Thread] 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Stopping 
> callback due to:
> java.io.IOException: Failed on local exception: java.io.EOFException; Host 
> Details : local host is: "m-tez-uns-try-3/1.1.1.1"; destination host is: 
> "m-tez-uns-try-3":8030;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy27.allocate(Unknown Source)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy28.allocate(Unknown Source)
>         at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
>         at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
> Caused by: java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
> 2014-10-08 02:24:12,257 INFO [TaskSchedulerAppCaller #0] 
> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: Error reported by 
> scheduler
> 2014-10-08 02:24:12,258 INFO [AsyncDispatcher event handler] 
> org.apache.tez.common.TezUtilsInternal: Redirecting log file based on addend: 
> dag_1412734988643_0001_1_post



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to