This is probably related to HADOOP-4795.
http://issues.apache.org/jira/browse/HADOOP-4795

We are testing it on 0.18 now. Should be committed soon.
Please let know if it is something else.

Thanks,
--Konstantin

Karl Kleinpaste wrote:
We have a cluster comprised of 21 nodes holding a total capacity of
about 55T where we have had a problem twice in the last couple weeks on
startup of NameNode.  We are running 0.18.1.  DFS space is currently
just below the halfway point of actual occupation, about 25T.

Symptom is that there is normal startup logging on NameNode's part,
where it self-analyzes its expected DFS content, reports #files known,
and begins to accept reports from slaves' DataNodes about blocks they
hold.  During this time, NameNode is in safe mode pending adequate block
discovery from slaves.  As the fraction of reported blocks rises,
eventually it hits the required 0.9990 threshold and announces that it
will leave safe mode in 30 seconds.

The problem occurs when, at the point of logging "0 seconds to leave
safe mode," NameNode hangs: It uses no more CPU; it logs nothing
further; it stops responding on its port 50070 web interface; "hadoop
fs" commands report no contact with NameNode; "netstat -atp" shows a
number of open connections on 9000 and 50070, indicating the connections
are being accepted, but NameNode never processes them.

This has happened twice in the last 2 weeks and it has us fairly
concerned.  Both times, it has been adequate simply to start over again,
and NameNode successfully comes to life the 2nd time around.  Is anyone
else familiar with this sort of hang, and do you know of any solutions?


Reply via email to