On Tue, May 8, 2012 at 7:46 AM, Nathaniel Cook <nathani...@qualtrics.com> wrote:
> We have be working with an HA hdfs cluster, testing several failover
> scenarios.  We have a small cluster of 4 machines spun up for testing.
> We run a namenode on two of the machines and hosted an nfs share on
> the third for the shared edits directory. The fourth machine is just a
> datanode. We configured the cluster for automatic failover using ZKFC.
> We can start and stop the namenodes with no problems, failover happens
> as expected. Then we tested breaking the shared edits directory. We
> stopped the nfs share and then reenabled it. This caused the loss of a
> few edits.

Really? What mount options are you using on your NFS mount?

The active NN should abort immediately if the shared edits dir
disappears. Do you have logs available from your NNs during this time?

> This had no effect, as expected, on the namenodes, and the
> cluster functioned normally.

On the contrary, I'd expect the NN to bail out on the next edit (since
it has no place to reliably fsync it)

> We stopped the standby namenode and tried
> to start it again, it would not start because of the missing edits. No
> matter what we tried we could not rebuild the shared edits directory
> and thus get the second namenode back online. In this state the hdfs
> cluster continued to function but it was no longer an HA cluster. To
> get the cluster back in HA mode we had to reformat the namenode data
> with the shared edits. In this case how do you rebuild the shared
> edits data so you can get the cluster back to an HA mode?

It sounds like something went wrong with the facility that's supposed
to make the active NN crash if shared edits go away. The logs will
help.

To answer your question, though, you can run the
"initializeSharedEdits" process again to re-initialize that edits dir.

Thanks
-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to