[
https://issues.apache.org/jira/browse/MAPREDUCE-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hitesh Shah updated MAPREDUCE-3058:
-----------------------------------
Attachment: MR-3058.wip.patch
Patch basically changes how containers are killed. It now triggers a sigterm
before a delayed sigkill.
Tweaked the handling of getting the pid slightly so that it can still be
obtained after the command completes. This does not really imply that we can
actually get the pid but just implies we can get to the executor/process object
which in turn may allow us to get the pid.
Requires some testing to see if this addresses the underlying issues.
@Karam, could you give it a try?
> Sometimes task keeps on running while its Syslog says that it is shutdown
> -------------------------------------------------------------------------
>
> Key: MAPREDUCE-3058
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3058
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: contrib/gridmix, mrv2
> Affects Versions: 0.23.0
> Reporter: Karam Singh
> Assignee: Hitesh Shah
> Priority: Critical
> Fix For: 0.23.0
>
> Attachments: MAPREDUCE-3058-20110923.txt, MR-3058.wip.patch
>
>
> While running GridMixV3, one of the jobs got stuck for 15 hrs. After clicking
> on the Job-page, found one of its reduces to be stuck. Looking at syslog of
> the stuck reducer, found this:
> Task-logs' head:
> {code}
> 2011-09-19 17:57:22,002 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period
> at 10 second(s).
> 2011-09-19 17:57:22,002 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system
> started
> {code}
> Task-logs' tail:
> {code}
> 2011-09-19 18:06:49,818 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
> createBlockOutputStream java.io.IOException: Bad connect ack with
> firstBadLink as <DATANODE1>
> 2011-09-19 18:06:49,818 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery
> for block
> BP-1405370709-<NAMENODE>-1316452621953:blk_-7004355226367468317_79871 in
> pipeline <DATANODE2>, <DATANODE1>: bad datanode <DATANODE1>
> 2011-09-19 18:06:49,818 DEBUG
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtocol:
> lastAckedSeqno = 26870
> 2011-09-19 18:06:49,820 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (26613121) connection to <NAMENODE> from gridperf sending #454
> 2011-09-19 18:06:49,826 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (26613121) connection to <<NAMENODE> from gridperf got value #454
> 2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.ipc.RPC: Call:
> getAdditionalDatanode 8
> 2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.hdfs.DFSClient: Connecting to
> datanode <DATANODE2>
> 2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.hdfs.DFSClient: Send buf size
> 131071
> 2011-09-19 18:06:49,833 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
> Exception
> java.io.EOFException: Premature EOF: no length prefix available
> at
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:158)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:860)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:929)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:740)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:415)
> 2011-09-19 18:06:49,837 WARN org.apache.hadoop.mapred.YarnChild: Exception
> running child : java.io.EOFException: Premature EOF: no length prefix
> available
> at
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:158)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:860)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:929)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:740)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:415)
> 2011-09-19 18:06:49,837 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (26613121) connection to <APPMASTER> from job_1316452677984_0862 sending #455
> 2011-09-19 18:06:49,839 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (26613121) connection to <APPMASTER> from job_1316452677984_0862 got value
> #455
> 2011-09-19 18:06:49,840 DEBUG org.apache.hadoop.ipc.RPC: Call: statusUpdate 3
> 2011-09-19 18:06:49,840 INFO org.apache.hadoop.mapred.Task: Runnning cleanup
> for the task
> 2011-09-19 18:06:49,840 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (26613121) connection to <NAMENODE> from gridperf sending #456
> 2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (26613121) connection to <NAMENODE> from gridperf got value #456
> 2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.RPC: Call: delete 18
> 2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (26613121) connection to <APPMASTER> from job_1316452677984_0862 sending #457
> 2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (26613121) connection to <APPMASTER> from job_1316452677984_0862 got value
> #457
> 2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.ipc.RPC: Call:
> reportDiagnosticInfo 1
> 2011-09-19 18:06:49,859 DEBUG
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: refCount=1
> 2011-09-19 18:06:49,859 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ReduceTask
> metrics system...
> 2011-09-19 18:06:49,859 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source
> UgiMetrics
> 2011-09-19 18:06:49,859 DEBUG
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: class
> org.apache.hadoop.metrics2.lib.MetricsSourceBuilder$1
> 2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans:
> Unregistering Hadoop:service=ReduceTask,name=UgiMetrics
> 2011-09-19 18:06:49,860 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source
> JvmMetrics
> 2011-09-19 18:06:49,860 DEBUG
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: class
> org.apache.hadoop.metrics2.source.JvmMetrics
> 2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans:
> Unregistering Hadoop:service=ReduceTask,name=JvmMetrics
> 2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans:
> Unregistering Hadoop:service=ReduceTask,name=MetricsSystem,sub=Stats
> 2011-09-19 18:06:49,860 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system
> stopped.
> 2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans:
> Unregistering Hadoop:service=ReduceTask,name=MetricsSystem,sub=Control
> 2011-09-19 18:06:49,860 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system
> shutdown complete.
> {code}
> Which means that tasks is supposed to have stopped within 20 secs, whereas
> the process itself is stuck for more than 15 hours. From AM log, also found
> that this task was sending its update regularly. ps -ef | grep java was also
> showing that process is still alive.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira