[
https://issues.apache.org/jira/browse/TEZ-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578764#comment-14578764
]
Jeff Zhang commented on TEZ-2534:
---------------------------------
[~hitesh] Please help review it.
> Error handling summary event when shutting down AM
> --------------------------------------------------
>
> Key: TEZ-2534
> URL: https://issues.apache.org/jira/browse/TEZ-2534
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.8.0
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-2534-1.patch
>
>
> When AM is shutting down, it will close the summary stream, but there may be
> still some events in the queue which will cause exception when handling
> summary event. And this would cause the next AM fail to recover the running
> dag.
> One way to resolve this issue is to always drain the events in the queue
> before closing the summary stream (set drainEventsFlag as true), but this
> flag may be useful in unit test.
> {noformat}
> 2015-06-03 16:37:15,761 INFO [Thread-1] app.DAGAppMaster:
> DAGAppMasterShutdownHook invoked
> 2015-06-03 16:37:15,761 INFO [Thread-1] app.DAGAppMaster: DAGAppMaster
> received a signal. Signaling TaskScheduler
> 2015-06-03 16:37:15,761 INFO [Thread-1] rm.TaskSchedulerEventHandler:
> TaskScheduler notified that iSignalled was : true
> 2015-06-03 16:37:15,762 INFO [Thread-1] history.HistoryEventHandler: Stopping
> HistoryEventHandler
> 2015-06-03 16:37:15,762 INFO [Thread-1] recovery.RecoveryService: Stopping
> RecoveryService
> 2015-06-03 16:37:15,762 INFO [Thread-1] recovery.RecoveryService: Closing
> Summary Stream
> 2015-06-03 16:37:15,772 INFO [Thread-1] recovery.RecoveryService: Closing
> Output Stream for DAG dag_1433320263267_0019_1
> 2015-06-03 16:37:15,773 ERROR [Dispatcher thread: Central]
> recovery.RecoveryService: Error handling summary event,
> eventType=VERTEX_FINISHED
> java.nio.channels.ClosedChannelException
> at
> org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:1622)
> at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:104)
> at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
> at java.io.DataOutputStream.write(DataOutputStream.java:107)
> at
> com.google.protobuf.CodedOutputStream.refreshBuffer(CodedOutputStream.java:833)
> at
> com.google.protobuf.CodedOutputStream.flush(CodedOutputStream.java:843)
> at
> com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:91)
> at
> org.apache.tez.dag.history.events.VertexFinishedEvent.toSummaryProtoStream(VertexFinishedEvent.java:207)
> at
> org.apache.tez.dag.history.recovery.RecoveryService.handleSummaryEvent(RecoveryService.java:373)
> at
> org.apache.tez.dag.history.recovery.RecoveryService.handle(RecoveryService.java:285)
> at
> org.apache.tez.dag.history.HistoryEventHandler.handleCriticalEvent(HistoryEventHandler.java:105)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl.logJobHistoryVertexCompletedHelper(VertexImpl.java:1890)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl.logJobHistoryVertexFinishedEvent(VertexImpl.java:1869)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl.finished(VertexImpl.java:2107)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl.finished(VertexImpl.java:2125)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl.checkTasksForCompletion(VertexImpl.java:1989)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl$TaskCompletedTransition.transition(VertexImpl.java:3833)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl$TaskCompletedTransition.transition(VertexImpl.java:1)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1799)
> at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1)
> at
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1954)
> at
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1)
> at
> org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114)
> at java.lang.Thread.run(Thread.java:745)
> 2015-06-03 16:37:15,775 ERROR [Dispatcher thread: Central]
> recovery.RecoveryService: Adding a flag to ensure next AM attempt does not
> start up,
> flagFile=hdfs://localhost:58857/tmp/owc-staging-dir/.tez/application_1433320263267_0019/recovery/1/RecoveryFatalErrorOccurred
> 2015-06-03 16:37:15,781 ERROR [Dispatcher thread: Central]
> recovery.RecoveryService: Recovery failure occurred. Skipping all events
> 2015-06-03 16:37:15,781 INFO [HistoryEventHandlingThread]
> impl.SimpleHistoryLoggingService: Writing event VERTEX_FINISHED to history
> file
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)