NameNode failover procedure

Andrzej Bialecki Fri, 20 Jul 2007 09:47:28 -0700

Hi all,

I've been reading the docs and the code, but I'm still somewhat hazy asto what is the exact step-by-step procedure to perform a failoverbetween a primary NameNode and a SecondaryNameNode, in case the formerexplodes or catches fire.

So far I learned that the secondary namenode keeps refreshingperiodically its backup copies of fsimage and editlog files, and if theprimary namenode disappears, it's the responsibility of the clusteradmin to notice this, shut down the cluster, switch the configs acrossthe cluster to point to the secondary namenode, start a primary namenodeon the secondary namenode's host, and restart the rest of the daemons.

In the meantime the other admin person frantically tries to restore theprimary namenode machine, and when it's ready we apply the process inreverse, or we make it into a secondary namenode.

Any comments, clarifications and/or automation of this procedure arewelcome. ;)


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

NameNode failover procedure

Reply via email to