We have be working with an HA hdfs cluster, testing several failover
scenarios.  We have a small cluster of 4 machines spun up for testing.
We run a namenode on two of the machines and hosted an nfs share on
the third for the shared edits directory. The fourth machine is just a
datanode. We configured the cluster for automatic failover using ZKFC.
We can start and stop the namenodes with no problems, failover happens
as expected. Then we tested breaking the shared edits directory. We
stopped the nfs share and then reenabled it. This caused the loss of a
few edits. This had no effect, as expected, on the namenodes, and the
cluster functioned normally. We stopped the standby namenode and tried
to start it again, it would not start because of the missing edits. No
matter what we tried we could not rebuild the shared edits directory
and thus get the second namenode back online. In this state the hdfs
cluster continued to function but it was no longer an HA cluster. To
get the cluster back in HA mode we had to reformat the namenode data
with the shared edits. In this case how do you rebuild the shared
edits data so you can get the cluster back to an HA mode?

-- 
-Nathaniel Cook

Reply via email to