[
https://issues.apache.org/jira/browse/HDFS-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zhengchenyu updated HDFS-15219:
-------------------------------
Description:
In my case, a Tez application stucked more than 2 hours util we kill this
applicaiton. The Reason is a task attempt stucked, becuase speculative
execution is disable.
Then Exception like this:
{code:java}
2020-03-11 01:23:59,141 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records
read - 100000
2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.FileSinkOperator|: FS[3]:
records written - 1000000
2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records
read - 1000000
2020-03-11 01:29:02,967 [FATAL] [ResponseProcessor for block
BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073]
|yarn.YarnUncaughtExceptionHandler|: Thread Thread[ResponseProcessor for block
BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073,5,main]
threw an Error. Shutting down now...
java.lang.NoClassDefFoundError: com/google/protobuf/TextFormat
at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.toString(PipelineAck.java:253)
at java.lang.String.valueOf(String.java:2847)
at java.lang.StringBuilder.append(StringBuilder.java:128)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:737)
Caused by: java.lang.ClassNotFoundException: com.google.protobuf.TextFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 4 more
Caused by: java.util.zip.ZipException: error reading zip file
at java.util.zip.ZipFile.read(Native Method)
at java.util.zip.ZipFile.access$1400(ZipFile.java:56)
at java.util.zip.ZipFile$ZipFileInputStream.read(ZipFile.java:679)
at java.util.zip.ZipFile$ZipFileInflaterInputStream.fill(ZipFile.java:415)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at sun.misc.Resource.getBytes(Resource.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:444)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
... 10 more
2020-03-11 01:29:02,970 [INFO] [ResponseProcessor for block
BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073]
|util.ExitUtil|: Exiting with status -1
2020-03-11 03:27:26,833 [INFO] [TaskHeartbeatThread] |task.TaskReporter|:
Received should die response from AM
2020-03-11 03:27:26,834 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Asked
to die via task heartbeat
2020-03-11 03:27:26,839 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|:
Attempting to abort attempt_1583335296048_917815_3_01_000704_0 due to an
invocation of shutdownRequested
{code}
Reason is UncaughtException. When time is 01:29, a disk was error, so throw
NoClassDefFoundError. ResponseProcessor.run only catch Exception, can't catch
NoClassDefFoundError. So the ReponseProcessor didn't set errorState. Then
DataStream didn't know ReponseProcessor was dead, and can't trigger
closeResponder, so stucked in DataStream.run.
I tested in unit-test TestDataStream.testDfsClient. When I throw
NoClassDefFoundError, the TestDataStream.testDfsClient will failed bacause of
timeout.
I think we should catch Throwable but not Exception in ReponseProcessor.run.
was:
In my case, a Tez application stucked more than 2 hours util we kill this
applicaiton. The Reason is a task attempt stucked, becuase speculative
execution is disable.
Then Exception like this:
{code:java}
2020-03-11 01:23:59,141 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records
read - 100000
2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.FileSinkOperator|: FS[3]:
records written - 1000000
2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records
read - 1000000
2020-03-11 01:29:02,967 [FATAL] [ResponseProcessor for block
BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073]
|yarn.YarnUncaughtExceptionHandler|: Thread Thread[ResponseProcessor for block
BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073,5,main]
threw an Error. Shutting down now...
java.lang.NoClassDefFoundError: com/google/protobuf/TextFormat
at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.toString(PipelineAck.java:253)
at java.lang.String.valueOf(String.java:2847)
at java.lang.StringBuilder.append(StringBuilder.java:128)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:737)
Caused by: java.lang.ClassNotFoundException: com.google.protobuf.TextFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 4 more
Caused by: java.util.zip.ZipException: error reading zip file
at java.util.zip.ZipFile.read(Native Method)
at java.util.zip.ZipFile.access$1400(ZipFile.java:56)
at java.util.zip.ZipFile$ZipFileInputStream.read(ZipFile.java:679)
at java.util.zip.ZipFile$ZipFileInflaterInputStream.fill(ZipFile.java:415)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at sun.misc.Resource.getBytes(Resource.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:444)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
... 10 more
2020-03-11 01:29:02,970 [INFO] [ResponseProcessor for block
BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073]
|util.ExitUtil|: Exiting with status -1
2020-03-11 03:27:26,833 [INFO] [TaskHeartbeatThread] |task.TaskReporter|:
Received should die response from AM
2020-03-11 03:27:26,834 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Asked
to die via task heartbeat
2020-03-11 03:27:26,839 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|:
Attempting to abort attempt_1583335296048_917815_3_01_000704_0 due to an
invocation of shutdownRequested
{code}
Reason is UncaughtException. ResponseProcessor.run
> DFS Client will stuck when ResponseProcessor.run throw Error
> ------------------------------------------------------------
>
> Key: HDFS-15219
> URL: https://issues.apache.org/jira/browse/HDFS-15219
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client
> Affects Versions: 2.7.3
> Reporter: zhengchenyu
> Priority: Major
> Fix For: 3.2.2
>
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> In my case, a Tez application stucked more than 2 hours util we kill this
> applicaiton. The Reason is a task attempt stucked, becuase speculative
> execution is disable.
> Then Exception like this:
> {code:java}
> 2020-03-11 01:23:59,141 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records
> read - 100000
> 2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.FileSinkOperator|: FS[3]:
> records written - 1000000
> 2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records
> read - 1000000
> 2020-03-11 01:29:02,967 [FATAL] [ResponseProcessor for block
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073]
> |yarn.YarnUncaughtExceptionHandler|: Thread Thread[ResponseProcessor for
> block
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073,5,main]
> threw an Error. Shutting down now...
> java.lang.NoClassDefFoundError: com/google/protobuf/TextFormat
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.toString(PipelineAck.java:253)
> at java.lang.String.valueOf(String.java:2847)
> at java.lang.StringBuilder.append(StringBuilder.java:128)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:737)
> Caused by: java.lang.ClassNotFoundException: com.google.protobuf.TextFormat
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 4 more
> Caused by: java.util.zip.ZipException: error reading zip file
> at java.util.zip.ZipFile.read(Native Method)
> at java.util.zip.ZipFile.access$1400(ZipFile.java:56)
> at java.util.zip.ZipFile$ZipFileInputStream.read(ZipFile.java:679)
> at java.util.zip.ZipFile$ZipFileInflaterInputStream.fill(ZipFile.java:415)
> at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
> at sun.misc.Resource.getBytes(Resource.java:124)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:444)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> ... 10 more
> 2020-03-11 01:29:02,970 [INFO] [ResponseProcessor for block
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073]
> |util.ExitUtil|: Exiting with status -1
> 2020-03-11 03:27:26,833 [INFO] [TaskHeartbeatThread] |task.TaskReporter|:
> Received should die response from AM
> 2020-03-11 03:27:26,834 [INFO] [TaskHeartbeatThread] |task.TaskReporter|:
> Asked to die via task heartbeat
> 2020-03-11 03:27:26,839 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|:
> Attempting to abort attempt_1583335296048_917815_3_01_000704_0 due to an
> invocation of shutdownRequested
> {code}
> Reason is UncaughtException. When time is 01:29, a disk was error, so throw
> NoClassDefFoundError. ResponseProcessor.run only catch Exception, can't catch
> NoClassDefFoundError. So the ReponseProcessor didn't set errorState. Then
> DataStream didn't know ReponseProcessor was dead, and can't trigger
> closeResponder, so stucked in DataStream.run.
> I tested in unit-test TestDataStream.testDfsClient. When I throw
> NoClassDefFoundError, the TestDataStream.testDfsClient will failed bacause of
> timeout.
> I think we should catch Throwable but not Exception in ReponseProcessor.run.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]