[ 
https://issues.apache.org/jira/browse/HDFS-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengchenyu updated HDFS-15219:
-------------------------------
    Description: 
In my case, a Tez application stucked more than 2 hours util we kill this 
applicaiton. The Reason is a task attempt stucked, becuase speculative 
execution is disable. 

Then Exception like this:
{code:java}
2020-03-11 01:23:59,141 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records 
read - 100000
2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.FileSinkOperator|: FS[3]: 
records written - 1000000
2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records 
read - 1000000
2020-03-11 01:29:02,967 [FATAL] [ResponseProcessor for block 
BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073] 
|yarn.YarnUncaughtExceptionHandler|: Thread Thread[ResponseProcessor for block 
BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073,5,main] 
threw an Error. Shutting down now...
java.lang.NoClassDefFoundError: com/google/protobuf/TextFormat
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.toString(PipelineAck.java:253)
 at java.lang.String.valueOf(String.java:2847)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:737)
Caused by: java.lang.ClassNotFoundException: com.google.protobuf.TextFormat
 at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 4 more
Caused by: java.util.zip.ZipException: error reading zip file
 at java.util.zip.ZipFile.read(Native Method)
 at java.util.zip.ZipFile.access$1400(ZipFile.java:56)
 at java.util.zip.ZipFile$ZipFileInputStream.read(ZipFile.java:679)
 at java.util.zip.ZipFile$ZipFileInflaterInputStream.fill(ZipFile.java:415)
 at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
 at sun.misc.Resource.getBytes(Resource.java:124)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:444)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 ... 10 more
2020-03-11 01:29:02,970 [INFO] [ResponseProcessor for block 
BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073] 
|util.ExitUtil|: Exiting with status -1
2020-03-11 03:27:26,833 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: 
Received should die response from AM
2020-03-11 03:27:26,834 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Asked 
to die via task heartbeat
2020-03-11 03:27:26,839 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: 
Attempting to abort attempt_1583335296048_917815_3_01_000704_0 due to an 
invocation of shutdownRequested

{code}
Reason is UncaughtException. When time is 01:29, a disk was error, so throw 
NoClassDefFoundError. ResponseProcessor.run only catch Exception, can't catch 
NoClassDefFoundError. So the ReponseProcessor didn't set errorState. Then 
DataStream didn't know ReponseProcessor was dead, and can't trigger 
closeResponder, so stucked in DataStream.run.

 I tested in unit-test TestDataStream.testDfsClient. When I throw 
NoClassDefFoundError in ResponseProcessor.run, the TestDataStream.testDfsClient 
will failed bacause of timeout.

I think we should catch Throwable but not Exception in ReponseProcessor.run.

 

  was:
In my case, a Tez application stucked more than 2 hours util we kill this 
applicaiton. The Reason is a task attempt stucked, becuase speculative 
execution is disable. 

Then Exception like this:
{code:java}
2020-03-11 01:23:59,141 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records 
read - 100000
2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.FileSinkOperator|: FS[3]: 
records written - 1000000
2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records 
read - 1000000
2020-03-11 01:29:02,967 [FATAL] [ResponseProcessor for block 
BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073] 
|yarn.YarnUncaughtExceptionHandler|: Thread Thread[ResponseProcessor for block 
BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073,5,main] 
threw an Error. Shutting down now...
java.lang.NoClassDefFoundError: com/google/protobuf/TextFormat
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.toString(PipelineAck.java:253)
 at java.lang.String.valueOf(String.java:2847)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:737)
Caused by: java.lang.ClassNotFoundException: com.google.protobuf.TextFormat
 at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 4 more
Caused by: java.util.zip.ZipException: error reading zip file
 at java.util.zip.ZipFile.read(Native Method)
 at java.util.zip.ZipFile.access$1400(ZipFile.java:56)
 at java.util.zip.ZipFile$ZipFileInputStream.read(ZipFile.java:679)
 at java.util.zip.ZipFile$ZipFileInflaterInputStream.fill(ZipFile.java:415)
 at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
 at sun.misc.Resource.getBytes(Resource.java:124)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:444)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 ... 10 more
2020-03-11 01:29:02,970 [INFO] [ResponseProcessor for block 
BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073] 
|util.ExitUtil|: Exiting with status -1
2020-03-11 03:27:26,833 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: 
Received should die response from AM
2020-03-11 03:27:26,834 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Asked 
to die via task heartbeat
2020-03-11 03:27:26,839 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: 
Attempting to abort attempt_1583335296048_917815_3_01_000704_0 due to an 
invocation of shutdownRequested

{code}
Reason is UncaughtException. When time is 01:29, a disk was error, so throw 
NoClassDefFoundError. ResponseProcessor.run only catch Exception, can't catch 
NoClassDefFoundError. So the ReponseProcessor didn't set errorState. Then 
DataStream didn't know ReponseProcessor was dead, and can't trigger 
closeResponder, so stucked in DataStream.run.

 I tested in unit-test TestDataStream.testDfsClient. When I throw 
NoClassDefFoundError, the TestDataStream.testDfsClient will failed bacause of 
timeout.

I think we should catch Throwable but not Exception in ReponseProcessor.run.

 


> DFS Client will stuck when ResponseProcessor.run throw Error
> ------------------------------------------------------------
>
>                 Key: HDFS-15219
>                 URL: https://issues.apache.org/jira/browse/HDFS-15219
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 2.7.3
>            Reporter: zhengchenyu
>            Priority: Major
>             Fix For: 3.2.2
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> In my case, a Tez application stucked more than 2 hours util we kill this 
> applicaiton. The Reason is a task attempt stucked, becuase speculative 
> execution is disable. 
> Then Exception like this:
> {code:java}
> 2020-03-11 01:23:59,141 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records 
> read - 100000
> 2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.FileSinkOperator|: FS[3]: 
> records written - 1000000
> 2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records 
> read - 1000000
> 2020-03-11 01:29:02,967 [FATAL] [ResponseProcessor for block 
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073] 
> |yarn.YarnUncaughtExceptionHandler|: Thread Thread[ResponseProcessor for 
> block 
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073,5,main] 
> threw an Error. Shutting down now...
> java.lang.NoClassDefFoundError: com/google/protobuf/TextFormat
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.toString(PipelineAck.java:253)
>  at java.lang.String.valueOf(String.java:2847)
>  at java.lang.StringBuilder.append(StringBuilder.java:128)
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:737)
> Caused by: java.lang.ClassNotFoundException: com.google.protobuf.TextFormat
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>  ... 4 more
> Caused by: java.util.zip.ZipException: error reading zip file
>  at java.util.zip.ZipFile.read(Native Method)
>  at java.util.zip.ZipFile.access$1400(ZipFile.java:56)
>  at java.util.zip.ZipFile$ZipFileInputStream.read(ZipFile.java:679)
>  at java.util.zip.ZipFile$ZipFileInflaterInputStream.fill(ZipFile.java:415)
>  at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
>  at sun.misc.Resource.getBytes(Resource.java:124)
>  at java.net.URLClassLoader.defineClass(URLClassLoader.java:444)
>  at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>  ... 10 more
> 2020-03-11 01:29:02,970 [INFO] [ResponseProcessor for block 
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073] 
> |util.ExitUtil|: Exiting with status -1
> 2020-03-11 03:27:26,833 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: 
> Received should die response from AM
> 2020-03-11 03:27:26,834 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: 
> Asked to die via task heartbeat
> 2020-03-11 03:27:26,839 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: 
> Attempting to abort attempt_1583335296048_917815_3_01_000704_0 due to an 
> invocation of shutdownRequested
> {code}
> Reason is UncaughtException. When time is 01:29, a disk was error, so throw 
> NoClassDefFoundError. ResponseProcessor.run only catch Exception, can't catch 
> NoClassDefFoundError. So the ReponseProcessor didn't set errorState. Then 
> DataStream didn't know ReponseProcessor was dead, and can't trigger 
> closeResponder, so stucked in DataStream.run.
>  I tested in unit-test TestDataStream.testDfsClient. When I throw 
> NoClassDefFoundError in ResponseProcessor.run, the 
> TestDataStream.testDfsClient will failed bacause of timeout.
> I think we should catch Throwable but not Exception in ReponseProcessor.run.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to