For our application we built service monitors that launch and monitor the Hadoop (and HBase) daemons as subprocesses. The monitors publish the service locations on a DHT that supports TTLs on values, with TTLs of 30s. Should the subprocess die, the monitor also exits, then the DHT values expire. Redundant cluster monitor processes monitor the DHT for service failure and can restart via SSH command a failed process or can reassign also via SSH command a service away from a failed node. By policy if the namenode fails after reassignment/restart all of the datanodes are restarted. Any service reliant on DFS that failed during DFS unavailability would also be restarted. The service monitors do service location discovery on the DHT and write hadoop-site.xml and hbase-site.xml files accordingly so when dependent services restart they automatically pick up any location changes.
I suppose we could have done the above with Zookeeper instead of a DHT. I don't have any code that I can share, but the above took me less than a week to accomplish, so I can say it is not difficult. Hope this helps, - Andy --- On Wed, 7/23/08, Pratyush Banerjee <[EMAIL PROTECTED]> wrote: > From: Pratyush Banerjee <[EMAIL PROTECTED]> > Subject: Automatic recovery Mechanism for namenode failure... > To: [email protected] > Date: Wednesday, July 23, 2008, 1:10 AM > Hi All, > > We have been using hadoop 0.17.1 for a 50 machine cluster. > > Since we have continuous weblogs being written into the > HDFS therein, we are concerned about the failure of the > namenode. Digging into hadoop documentation, i found out > that currently hadoop does not support automatic recovery > of the namenode. > [...] > However for our situation we intend to have a mechanism > that will detect a namenode failure. and automatically > startup the namenode with -importcheckpoint option in the > secondary namenode server. > When i say automatically it necessarily means absolutely > no manual intervention at the point of failure and startup.
