[
https://issues.apache.org/jira/browse/HADOOP-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
dhruba borthakur resolved HADOOP-5182.
--------------------------------------
Resolution: Duplicate
I am closing this one because I think this is a duplicate of HADOOP-3998.
Please re-open if you think otherwise.
> Task processes not exiting due to ackQueue bug in DFSClient
> -----------------------------------------------------------
>
> Key: HADOOP-5182
> URL: https://issues.apache.org/jira/browse/HADOOP-5182
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.20.0, 0.21.0
> Environment: EC2 with Ubuntu Linux 2.6.16 AMI. Hadoop trunk revision
> 734915
> Reporter: Andy Konwinski
> Attachments: jstack_ackqueue_bug
>
>
> I was running some gridmix tests on a 10 node cluster on EC2 and ran into an
> issue with unmodified Hadoop trunk revision (SVN revision# 734915). After
> running gridmix multiple times, I noticed several mapreduce jobs stuck in the
> running state. They remained in that hung state for several days, while other
> gridmixes of that size finished in approximately 8 hours, and are still in
> that state actually.
> I saw that the slave nodes had a bunch of hung task processes running, for
> tasks that the JobTracker log said were completed. These were hanging because
> the SIGTERM handler was waiting on DFSClient to close existing streams, but
> this was never finishing because DFSOutputStream waits on an ackQueue from
> the datanodes that was apparently getting no acks. The tasks did finish their
> work, but the processes hung around.
> I'll attached a sample jstack trace - note how the SIGTERM handler is blocked
> on "thread-5", which is waiting for a monitor on the DFSOutputStream, but
> this stream's monitor is held by main, which is trying to flush the stream
> (last trace in the file).
> Has anyone seen this issue before?
> Another thing I saw was cleanup tasks never running (they are stuck in the
> initializing state on the web UI and can't be seen as running processes on
> the nodes). Not sure if that is actually related.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.