Jeff Zhang created TEZ-2260:
-------------------------------
Summary: AM been shutdown due to NoSuchMethodError in DAGProtos
Key: TEZ-2260
URL: https://issues.apache.org/jira/browse/TEZ-2260
Project: Apache Tez
Issue Type: Bug
Reporter: Jeff Zhang
Not sure why this happens, maybe due to environment issue.
{code}
2015-04-01 09:08:49,757 INFO [Dispatcher thread: Central]
history.HistoryEventHandler:
[HISTORY][DAG:dag_1427850436467_0007_1][Event:TASK_ATTEMPT_FINISHED]:
vertexName=datagen, taskAttemptId=attempt_1427850436467_0007_1_00_000000_0,
startTime=1427850527981, finishTime=1427850529750, timeTaken=1769,
status=SUCCEEDED, errorEnum=, diagnostics=, counters=Counters: 8, File System
Counters, HDFS_BYTES_READ=0, HDFS_BYTES_WRITTEN=953030, HDFS_READ_OPS=9,
HDFS_LARGE_READ_OPS=0, HDFS_WRITE_OPS=6,
org.apache.tez.common.counters.TaskCounter, GC_TIME_MILLIS=46,
COMMITTED_HEAP_BYTES=257425408, OUTPUT_RECORDS=44195
2015-04-01 09:08:49,757 FATAL [RecoveryEventHandlingThread]
yarn.YarnUncaughtExceptionHandler: Thread
Thread[RecoveryEventHandlingThread,5,main] threw an Error. Shutting down now...
java.lang.NoSuchMethodError:
org.apache.tez.dag.api.records.DAGProtos$TezCountersProto$Builder.access$26000()Lorg/apache/tez/dag/api/records/DAGProtos$TezCountersProto$Builder;
at
org.apache.tez.dag.api.records.DAGProtos$TezCountersProto.newBuilder(DAGProtos.java:24581)
at
org.apache.tez.dag.api.DagTypeConverters.convertTezCountersToProto(DagTypeConverters.java:544)
at
org.apache.tez.dag.history.events.TaskAttemptFinishedEvent.toProto(TaskAttemptFinishedEvent.java:97)
at
org.apache.tez.dag.history.events.TaskAttemptFinishedEvent.toProtoStream(TaskAttemptFinishedEvent.java:120)
at
org.apache.tez.dag.history.recovery.RecoveryService.handleRecoveryEvent(RecoveryService.java:403)
at
org.apache.tez.dag.history.recovery.RecoveryService.access$700(RecoveryService.java:50)
at
org.apache.tez.dag.history.recovery.RecoveryService$1.run(RecoveryService.java:158)
at java.lang.Thread.run(Thread.java:745)
2015-04-01 09:08:49,757 INFO [Dispatcher thread: Central] impl.TaskAttemptImpl:
attempt_1427850436467_0007_1_00_000000_0 TaskAttempt Transitioned from RUNNING
to SUCCEEDED due to event TA_DONE
{code}
This issue result in several consequent issues. Because this error cause the AM
to recovery in the next attempt. But in the next attempt it meet the following
issue, looks like data node crashed.
{code}
2015-04-01 09:09:00,093 WARN [Thread-82] hdfs.DFSClient: DataStreamer Exception
java.io.IOException: Failed to replace a bad datanode on the existing pipeline
due to no more good datanodes being available to try. (Nodes:
current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238,
127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT,
and a client may configure this via
'dfs.client.block.write.replace-datanode-on-failure.policy' in its
configuration.
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
2015-04-01 09:09:00,093 WARN [Dispatcher thread: Central] hdfs.DFSClient: Error
while syncing
java.io.IOException: Failed to replace a bad datanode on the existing pipeline
due to no more good datanodes being available to try. (Nodes:
current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238,
127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT,
and a client may configure this via
'dfs.client.block.write.replace-datanode-on-failure.policy' in its
configuration.
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
2015-04-01 09:09:00,094 ERROR [Dispatcher thread: Central]
recovery.RecoveryService: Error handling summary event,
eventType=VERTEX_FINISHED
java.io.IOException: Failed to replace a bad datanode on the existing pipeline
due to no more good datanodes being available to try. (Nodes:
current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238,
127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT,
and a client may configure this via
'dfs.client.block.write.replace-datanode-on-failure.policy' in its
configuration.
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
{code}
Because of the above issue (summary recovery log error), it cause the AM
shutdown, and in the client side, it throw SessionNotRunning Exception without
any diagnostic info.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)