What are people doing?

For jobs that have a long enough SLA, just shutting down the cluster and bringing up the secondary as the master works for us. We have some jobs where that doesn't work well, because the recovery time is not acceptable.

There has been internal discussion of using drdb to hotfail a namenode to a backup so that the running job can continue.

Reply via email to