[
https://issues.apache.org/jira/browse/TEZ-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150320#comment-14150320
]
Jeff Zhang commented on TEZ-1621:
---------------------------------
bq. This will break when we stop sending the failure notification to the AM
inline (thus blocking the System.exit() until the AM has been notified).
What does this mean ? sendFailure should not block the System.exit(), it would
either success (heartbeat success) or throw exception (heartbeat failure, this
would also cause TezChild shutdown).
But, agree on putting the System.exit into TezChild which would make code clean
and easy to maintain.
> Should report error to AM before shuting down TezChild
> ------------------------------------------------------
>
> Key: TEZ-1621
> URL: https://issues.apache.org/jira/browse/TEZ-1621
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Deepesh Khandelwal
> Assignee: Jeff Zhang
> Attachments: Tez-1621.patch, app_logs.txt, console.txt
>
>
> While running an in session testorderedwordcount example the DAG failed with
> the following error on the console:
> {noformat}
> 14/09/25 01:55:53 INFO examples.TestOrderedWordCount: DAG 1 diagnostics:
> [Vertex failed, vertexName=initialmap,
> vertexId=vertex_1411586515507_0110_1_00, diagnostics=[Task failed,
> taskId=task_1411586515507_0110_1_00_000000, diagnostics=[TaskAttempt 0
> failed, info=[Container container_1411586515507_0110_01_000002 finished with
> diagnostics set to [Container failed. Exception from container-launch.
> Container id: container_1411586515507_0110_01_000002
> Exit code: 255
> Stack trace: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:290)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This wasn't very helpful, the root cause is in the application log:
> {noformat}
> 2014-09-25 01:55:41,246 ERROR [TezChild]
> org.apache.tez.runtime.task.TezTaskRunner: Exception of type Error. Exiting
> now
> java.lang.UnsatisfiedLinkError:
> org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;J)V
> at org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(Native
> Method)
> at
> org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:57)
> at
> org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:291)
> at
> org.apache.hadoop.hdfs.BlockReaderLocal.fillBuffer(BlockReaderLocal.java:344)
> at
> org.apache.hadoop.hdfs.BlockReaderLocal.fillDataBuf(BlockReaderLocal.java:444)
> at
> org.apache.hadoop.hdfs.BlockReaderLocal.readWithBounceBuffer(BlockReaderLocal.java:575)
> at
> org.apache.hadoop.hdfs.BlockReaderLocal.read(BlockReaderLocal.java:539)
> at
> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:683)
> at
> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:739)
> at
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:796)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
> at
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
> at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)
> at
> org.apache.hadoop.mapreduce.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.nextKeyValue(TezGroupedSplitsInputFormat.java:167)
> at
> org.apache.tez.mapreduce.lib.MRReaderMapReduce.next(MRReaderMapReduce.java:116)
> at
> org.apache.tez.mapreduce.processor.map.MapProcessor$NewRecordReader.nextKeyValue(MapProcessor.java:266)
> at
> org.apache.tez.mapreduce.hadoop.mapreduce.MapContextImpl.nextKeyValue(MapContextImpl.java:81)
> at
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at
> org.apache.tez.mapreduce.processor.map.MapProcessor.runNewMapper(MapProcessor.java:237)
> at
> org.apache.tez.mapreduce.processor.map.MapProcessor.run(MapProcessor.java:124)
> at
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324)
> at
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180)
> at
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:172)
> at
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:167)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 2014-09-25 01:55:41,250 INFO [TezChild] org.apache.hadoop.util.ExitUtil:
> Exiting with status -1
> {noformat}
> Attached are the complete console.log and application log.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)