[ 
https://issues.apache.org/jira/browse/TEZ-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579012#comment-14579012
 ] 

Hitesh Shah commented on TEZ-2534:
----------------------------------

+1

> Error handling summary event when shutting down AM
> --------------------------------------------------
>
>                 Key: TEZ-2534
>                 URL: https://issues.apache.org/jira/browse/TEZ-2534
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2534-1.patch
>
>
> When AM is shutting down, it will close the summary stream, but there may be 
> still some events in the queue which will cause exception when handling 
> summary event. And this would cause the next AM fail to recover the running 
> dag. 
> One way to resolve this issue is to always drain the events in the queue 
> before closing the summary stream (set drainEventsFlag as true), but this 
> flag may be useful in unit test. 
> {noformat}
> 2015-06-03 16:37:15,761 INFO [Thread-1] app.DAGAppMaster: 
> DAGAppMasterShutdownHook invoked
> 2015-06-03 16:37:15,761 INFO [Thread-1] app.DAGAppMaster: DAGAppMaster 
> received a signal. Signaling TaskScheduler
> 2015-06-03 16:37:15,761 INFO [Thread-1] rm.TaskSchedulerEventHandler: 
> TaskScheduler notified that iSignalled was : true
> 2015-06-03 16:37:15,762 INFO [Thread-1] history.HistoryEventHandler: Stopping 
> HistoryEventHandler
> 2015-06-03 16:37:15,762 INFO [Thread-1] recovery.RecoveryService: Stopping 
> RecoveryService
> 2015-06-03 16:37:15,762 INFO [Thread-1] recovery.RecoveryService: Closing 
> Summary Stream
> 2015-06-03 16:37:15,772 INFO [Thread-1] recovery.RecoveryService: Closing 
> Output Stream for DAG dag_1433320263267_0019_1
> 2015-06-03 16:37:15,773 ERROR [Dispatcher thread: Central] 
> recovery.RecoveryService: Error handling summary event, 
> eventType=VERTEX_FINISHED
> java.nio.channels.ClosedChannelException
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:1622)
>       at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:104)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
>       at java.io.DataOutputStream.write(DataOutputStream.java:107)
>       at 
> com.google.protobuf.CodedOutputStream.refreshBuffer(CodedOutputStream.java:833)
>       at 
> com.google.protobuf.CodedOutputStream.flush(CodedOutputStream.java:843)
>       at 
> com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:91)
>       at 
> org.apache.tez.dag.history.events.VertexFinishedEvent.toSummaryProtoStream(VertexFinishedEvent.java:207)
>       at 
> org.apache.tez.dag.history.recovery.RecoveryService.handleSummaryEvent(RecoveryService.java:373)
>       at 
> org.apache.tez.dag.history.recovery.RecoveryService.handle(RecoveryService.java:285)
>       at 
> org.apache.tez.dag.history.HistoryEventHandler.handleCriticalEvent(HistoryEventHandler.java:105)
>       at 
> org.apache.tez.dag.app.dag.impl.VertexImpl.logJobHistoryVertexCompletedHelper(VertexImpl.java:1890)
>       at 
> org.apache.tez.dag.app.dag.impl.VertexImpl.logJobHistoryVertexFinishedEvent(VertexImpl.java:1869)
>       at 
> org.apache.tez.dag.app.dag.impl.VertexImpl.finished(VertexImpl.java:2107)
>       at 
> org.apache.tez.dag.app.dag.impl.VertexImpl.finished(VertexImpl.java:2125)
>       at 
> org.apache.tez.dag.app.dag.impl.VertexImpl.checkTasksForCompletion(VertexImpl.java:1989)
>       at 
> org.apache.tez.dag.app.dag.impl.VertexImpl$TaskCompletedTransition.transition(VertexImpl.java:3833)
>       at 
> org.apache.tez.dag.app.dag.impl.VertexImpl$TaskCompletedTransition.transition(VertexImpl.java:1)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>       at 
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
>       at 
> org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1799)
>       at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1)
>       at 
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1954)
>       at 
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1)
>       at 
> org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
>       at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114)
>       at java.lang.Thread.run(Thread.java:745)
> 2015-06-03 16:37:15,775 ERROR [Dispatcher thread: Central] 
> recovery.RecoveryService: Adding a flag to ensure next AM attempt does not 
> start up, 
> flagFile=hdfs://localhost:58857/tmp/owc-staging-dir/.tez/application_1433320263267_0019/recovery/1/RecoveryFatalErrorOccurred
> 2015-06-03 16:37:15,781 ERROR [Dispatcher thread: Central] 
> recovery.RecoveryService: Recovery failure occurred. Skipping all events
> 2015-06-03 16:37:15,781 INFO [HistoryEventHandlingThread] 
> impl.SimpleHistoryLoggingService: Writing event VERTEX_FINISHED to history 
> file
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to