Adar Dembo created KUDU-2913:
--------------------------------
Summary: Document how a freshly formatted master can rejoin its
multi-master deployment
Key: KUDU-2913
URL: https://issues.apache.org/jira/browse/KUDU-2913
Project: Kudu
Issue Type: Bug
Components: documentation
Affects Versions: 1.11.0
Reporter: Adar Dembo
Suppose you have three masters. One suffers a hardware failure. Normally you'd
follow the steps [outlined
here|https://kudu.apache.org/docs/administration.html#_recovering_from_a_dead_kudu_master_in_a_multi_master_deployment]
to safely replace the dead master. But suppose you forget, and instead, you
bring in a new machine with the same DNS name and IP address and start a fresh
master with the same configuration ({{\-\-fs_wal_dir}} and
{{\-\-master_addrs}}) as the dead master.
Now you're in a bind, because the new master has a new UUID which the two
remaining masters don't expect. It is unable to communicate with them, doesn't
join their consensus group, and thus the multi-master deployment remains
degraded.
The workflow to fix this is a variant of the recovery workflow:
# Stop the new master.
# Delete all of the data out of the new master's WAL and data directories.
# Run {{sudo -u kudu kudu fs format}} using the new master's WAL and data
directories.
# Run {{sudo -u kudu pbc edit}} on the various FS instance files (one in the
WAL directory and one in each data directory) in the new master. Replace the
UUID created during the format operation with the old master's UUID, which is
expected by the two remaining masters. Note: {{kudu pbc edit}} expects the UUID
as a base64-encoded string; you'll need to base64-encode the old UUID before
splicing it in.
# Run {{sudo -u kudu kudu remote_replica copy}} to copy the master tablet from
one of the good masters.
# Start the new master.
# Wait for it to load the master tablet and join consensus. You can probably
use ksck to determine when this will happen.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)