[ https://issues.apache.org/jira/browse/HADOOP-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706780#action_12706780 ]
Xing Shi commented on HADOOP-5182: ---------------------------------- https://issues.apache.org/jira/browse/HADOOP-3998 this bug has been fixed? > Task processes not exiting due to ackQueue bug in DFSClient > ----------------------------------------------------------- > > Key: HADOOP-5182 > URL: https://issues.apache.org/jira/browse/HADOOP-5182 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Affects Versions: 0.20.0, 0.21.0 > Environment: EC2 with Ubuntu Linux 2.6.16 AMI. Hadoop trunk revision > 734915 > Reporter: Andy Konwinski > Attachments: jstack_ackqueue_bug > > > I was running some gridmix tests on a 10 node cluster on EC2 and ran into an > issue with unmodified Hadoop trunk revision (SVN revision# 734915). After > running gridmix multiple times, I noticed several mapreduce jobs stuck in the > running state. They remained in that hung state for several days, while other > gridmixes of that size finished in approximately 8 hours, and are still in > that state actually. > I saw that the slave nodes had a bunch of hung task processes running, for > tasks that the JobTracker log said were completed. These were hanging because > the SIGTERM handler was waiting on DFSClient to close existing streams, but > this was never finishing because DFSOutputStream waits on an ackQueue from > the datanodes that was apparently getting no acks. The tasks did finish their > work, but the processes hung around. > I'll attached a sample jstack trace - note how the SIGTERM handler is blocked > on "thread-5", which is waiting for a monitor on the DFSOutputStream, but > this stream's monitor is held by main, which is trying to flush the stream > (last trace in the file). > Has anyone seen this issue before? > Another thing I saw was cleanup tasks never running (they are stuck in the > initializing state on the web UI and can't be seen as running processes on > the nodes). Not sure if that is actually related. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.