Hi all,

I've been reading the docs and the code, but I'm still somewhat hazy as to what is the exact step-by-step procedure to perform a failover between a primary NameNode and a SecondaryNameNode, in case the former explodes or catches fire.

So far I learned that the secondary namenode keeps refreshing periodically its backup copies of fsimage and editlog files, and if the primary namenode disappears, it's the responsibility of the cluster admin to notice this, shut down the cluster, switch the configs across the cluster to point to the secondary namenode, start a primary namenode on the secondary namenode's host, and restart the rest of the daemons.

In the meantime the other admin person frantically tries to restore the primary namenode machine, and when it's ready we apply the process in reverse, or we make it into a secondary namenode.

Any comments, clarifications and/or automation of this procedure are welcome. ;)

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to