[
https://issues.apache.org/jira/browse/HDFS-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ayush Saxena reassigned HDFS-15219:
-----------------------------------
Assignee: zhengchenyu
> DFS Client will stuck when ResponseProcessor.run throw Error
> ------------------------------------------------------------
>
> Key: HDFS-15219
> URL: https://issues.apache.org/jira/browse/HDFS-15219
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client
> Affects Versions: 2.7.3
> Reporter: zhengchenyu
> Assignee: zhengchenyu
> Priority: Major
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> In my case, a Tez application stucked more than 2 hours util we kill this
> applicaiton. The Reason is a task attempt stucked, becuase speculative
> execution is disable.
> Then Exception like this:
> {code:java}
> 2020-03-11 01:23:59,141 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records
> read - 100000
> 2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.FileSinkOperator|: FS[3]:
> records written - 1000000
> 2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records
> read - 1000000
> 2020-03-11 01:29:02,967 [FATAL] [ResponseProcessor for block
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073]
> |yarn.YarnUncaughtExceptionHandler|: Thread Thread[ResponseProcessor for
> block
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073,5,main]
> threw an Error. Shutting down now...
> java.lang.NoClassDefFoundError: com/google/protobuf/TextFormat
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.toString(PipelineAck.java:253)
> at java.lang.String.valueOf(String.java:2847)
> at java.lang.StringBuilder.append(StringBuilder.java:128)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:737)
> Caused by: java.lang.ClassNotFoundException: com.google.protobuf.TextFormat
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 4 more
> Caused by: java.util.zip.ZipException: error reading zip file
> at java.util.zip.ZipFile.read(Native Method)
> at java.util.zip.ZipFile.access$1400(ZipFile.java:56)
> at java.util.zip.ZipFile$ZipFileInputStream.read(ZipFile.java:679)
> at java.util.zip.ZipFile$ZipFileInflaterInputStream.fill(ZipFile.java:415)
> at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
> at sun.misc.Resource.getBytes(Resource.java:124)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:444)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> ... 10 more
> 2020-03-11 01:29:02,970 [INFO] [ResponseProcessor for block
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073]
> |util.ExitUtil|: Exiting with status -1
> 2020-03-11 03:27:26,833 [INFO] [TaskHeartbeatThread] |task.TaskReporter|:
> Received should die response from AM
> 2020-03-11 03:27:26,834 [INFO] [TaskHeartbeatThread] |task.TaskReporter|:
> Asked to die via task heartbeat
> 2020-03-11 03:27:26,839 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|:
> Attempting to abort attempt_1583335296048_917815_3_01_000704_0 due to an
> invocation of shutdownRequested
> {code}
> Reason is UncaughtException. When time is 01:29, a disk was error, so throw
> NoClassDefFoundError. ResponseProcessor.run only catch Exception, can't catch
> NoClassDefFoundError. So the ReponseProcessor didn't set errorState. Then
> DataStream didn't know ReponseProcessor was dead, and can't trigger
> closeResponder, so stucked in DataStream.run.
> I tested in unit-test TestDataStream.testDfsClient. When I throw
> NoClassDefFoundError in ResponseProcessor.run, the
> TestDataStream.testDfsClient will failed bacause of timeout.
> I think we should catch Throwable but not Exception in ReponseProcessor.run.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]