[ 
https://issues.apache.org/jira/browse/HADOOP-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706780#action_12706780
 ] 

Xing Shi commented on HADOOP-5182:
----------------------------------

https://issues.apache.org/jira/browse/HADOOP-3998

this bug has been fixed?

> Task processes not exiting due to ackQueue bug in DFSClient
> -----------------------------------------------------------
>
>                 Key: HADOOP-5182
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5182
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0, 0.21.0
>         Environment: EC2 with Ubuntu Linux 2.6.16 AMI. Hadoop trunk revision 
> 734915
>            Reporter: Andy Konwinski
>         Attachments: jstack_ackqueue_bug
>
>
> I was running some gridmix tests on a 10 node cluster on EC2 and ran into an 
> issue with unmodified Hadoop trunk revision (SVN revision#  734915). After 
> running gridmix multiple times, I noticed several mapreduce jobs stuck in the 
> running state. They remained in that hung state for several days, while other 
> gridmixes of that size finished in approximately 8 hours, and are still in 
> that state actually.
> I saw that the slave nodes had a bunch of hung task processes running, for 
> tasks that the JobTracker log said were completed. These were hanging because 
> the SIGTERM handler was waiting on DFSClient to close existing streams, but 
> this was never finishing because DFSOutputStream waits on an ackQueue from 
> the datanodes that was apparently getting no acks. The tasks did finish their 
> work, but the processes hung around.
> I'll attached a sample jstack trace - note how the SIGTERM handler is blocked 
> on "thread-5", which is waiting for a monitor on the DFSOutputStream, but 
> this stream's monitor is held by main, which is trying to flush the stream 
> (last trace in the file).
> Has anyone seen this issue before?
> Another thing I saw was cleanup tasks never running (they are stuck in the 
> initializing state on the web UI and can't be seen as running processes on 
> the nodes). Not sure if that is actually related.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to