[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3058:
-----------------------------------------------

    Attachment: MAPREDUCE-3058-20111021.txt

My earlier suspicion that a non-daemon thread created by user's code hanging 
the JVM also turned wrong. In that case, eventually the child heads for 
shutdown, kills the umbilical client which in turns makes Task's reporter 
thread to sys-exit with 65 exit-code after about 10 seconds.

Something else happened in this case - may be the DFSClient hung and caused the 
JVM to get stuck.

Attaching patch that should fix such cases. For non-daemon hanging threads, 
this will avoid the 10 seconds delay which I confirmed with the test.
                
> Sometimes task keeps on running while its Syslog says that it is shutdown
> -------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3058
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3058
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/gridmix, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Karam Singh
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3058-20110923.txt, 
> MAPREDUCE-3058-20111021.txt, MR-3058.wip.patch
>
>
> While running GridMixV3, one of the jobs got stuck for 15 hrs. After clicking 
> on the Job-page, found one of its reduces to be stuck. Looking at syslog of 
> the stuck reducer, found this:
> Task-logs' head:
> {code}
> 2011-09-19 17:57:22,002 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period 
> at 10 second(s).
> 2011-09-19 17:57:22,002 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system 
> started
> {code}
> Task-logs' tail:
> {code}
> 2011-09-19 18:06:49,818 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
> createBlockOutputStream java.io.IOException: Bad connect ack with 
> firstBadLink as <DATANODE1>
> 2011-09-19 18:06:49,818 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
> for block 
> BP-1405370709-<NAMENODE>-1316452621953:blk_-7004355226367468317_79871 in 
> pipeline  <DATANODE2>,  <DATANODE1>: bad datanode  <DATANODE1>
> 2011-09-19 18:06:49,818 DEBUG 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtocol: 
> lastAckedSeqno = 26870
> 2011-09-19 18:06:49,820 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
> (26613121) connection to <NAMENODE> from gridperf sending #454
> 2011-09-19 18:06:49,826 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
> (26613121) connection to <<NAMENODE> from gridperf got value #454
> 2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.ipc.RPC: Call: 
> getAdditionalDatanode 8
> 2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.hdfs.DFSClient: Connecting to 
> datanode <DATANODE2>
> 2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.hdfs.DFSClient: Send buf size 
> 131071
> 2011-09-19 18:06:49,833 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer 
> Exception
> java.io.EOFException: Premature EOF: no length prefix available
>         at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:158)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:860)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:929)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:740)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:415)
> 2011-09-19 18:06:49,837 WARN org.apache.hadoop.mapred.YarnChild: Exception 
> running child : java.io.EOFException: Premature EOF: no length prefix 
> available
>         at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:158)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:860)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:929)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:740)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:415)
> 2011-09-19 18:06:49,837 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
> (26613121) connection to <APPMASTER> from job_1316452677984_0862 sending #455
> 2011-09-19 18:06:49,839 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
> (26613121) connection to <APPMASTER> from job_1316452677984_0862 got value 
> #455
> 2011-09-19 18:06:49,840 DEBUG org.apache.hadoop.ipc.RPC: Call: statusUpdate 3
> 2011-09-19 18:06:49,840 INFO org.apache.hadoop.mapred.Task: Runnning cleanup 
> for the task
> 2011-09-19 18:06:49,840 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
> (26613121) connection to <NAMENODE> from gridperf sending #456
> 2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
> (26613121) connection to <NAMENODE> from gridperf got value #456
> 2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.RPC: Call: delete 18
> 2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
> (26613121) connection to <APPMASTER> from job_1316452677984_0862 sending #457
> 2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
> (26613121) connection to <APPMASTER> from job_1316452677984_0862 got value 
> #457
> 2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.ipc.RPC: Call: 
> reportDiagnosticInfo 1
> 2011-09-19 18:06:49,859 DEBUG 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: refCount=1
> 2011-09-19 18:06:49,859 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ReduceTask 
> metrics system...
> 2011-09-19 18:06:49,859 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source 
> UgiMetrics
> 2011-09-19 18:06:49,859 DEBUG 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: class 
> org.apache.hadoop.metrics2.lib.MetricsSourceBuilder$1
> 2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: 
> Unregistering Hadoop:service=ReduceTask,name=UgiMetrics
> 2011-09-19 18:06:49,860 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source 
> JvmMetrics
> 2011-09-19 18:06:49,860 DEBUG 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: class 
> org.apache.hadoop.metrics2.source.JvmMetrics
> 2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: 
> Unregistering Hadoop:service=ReduceTask,name=JvmMetrics
> 2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: 
> Unregistering Hadoop:service=ReduceTask,name=MetricsSystem,sub=Stats
> 2011-09-19 18:06:49,860 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system 
> stopped.
> 2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: 
> Unregistering Hadoop:service=ReduceTask,name=MetricsSystem,sub=Control
> 2011-09-19 18:06:49,860 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system 
> shutdown complete.
> {code}
> Which means that tasks is supposed to have stopped within 20 secs, whereas 
> the process itself is stuck for more than 15 hours. From AM log, also found 
> that this task was sending its update regularly. ps -ef | grep java was also 
> showing that process is still alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to