Task processes not exiting due to ackQueue bug in DFSClient
-----------------------------------------------------------
Key: HADOOP-5182
URL: https://issues.apache.org/jira/browse/HADOOP-5182
Project: Hadoop Core
Issue Type: Bug
Components: dfs
Affects Versions: 0.20.0, 0.21.0
Environment: EC2 with Ubuntu Linux 2.6.16 AMI. Hadoop trunk revision
734915
Reporter: Andy Konwinski
I was running some gridmix tests on a 10 node cluster on EC2 and ran into an
issue with unmodified Hadoop trunk revision (SVN revision# 734915). After
running gridmix multiple times, I noticed several mapreduce jobs stuck in the
running state. They remained in that hung state for several days, while other
gridmixes of that size finished in approximately 8 hours, and are still in that
state actually.
I saw that the slave nodes had a bunch of hung task processes running, for
tasks that the JobTracker log said were completed. These were hanging because
the SIGTERM handler was waiting on DFSClient to close existing streams, but
this was never finishing because DFSOutputStream waits on an ackQueue from the
datanodes that was apparently getting no acks. The tasks did finish their work,
but the processes hung around.
I'll attached a sample jstack trace - note how the SIGTERM handler is blocked
on "thread-5", which is waiting for a monitor on the DFSOutputStream, but this
stream's monitor is held by main, which is trying to flush the stream (last
trace in the file).
Has anyone seen this issue before?
Another thing I saw was cleanup tasks never running (they are stuck in the
initializing state on the web UI and can't be seen as running processes on the
nodes). Not sure if that is actually related.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.