Adar Dembo created KUDU-2913:
--------------------------------

             Summary: Document how a freshly formatted master can rejoin its 
multi-master deployment
                 Key: KUDU-2913
                 URL: https://issues.apache.org/jira/browse/KUDU-2913
             Project: Kudu
          Issue Type: Bug
          Components: documentation
    Affects Versions: 1.11.0
            Reporter: Adar Dembo


Suppose you have three masters. One suffers a hardware failure. Normally you'd 
follow the steps [outlined 
here|https://kudu.apache.org/docs/administration.html#_recovering_from_a_dead_kudu_master_in_a_multi_master_deployment]
 to safely replace the dead master. But suppose you forget, and instead, you 
bring in a new machine with the same DNS name and IP address and start a fresh 
master with the same configuration ({{\-\-fs_wal_dir}} and 
{{\-\-master_addrs}}) as the dead master.

Now you're in a bind, because the new master has a new UUID which the two 
remaining masters don't expect. It is unable to communicate with them, doesn't 
join their consensus group, and thus the multi-master deployment remains 
degraded.

The workflow to fix this is a variant of the recovery workflow:
# Stop the new master.
# Delete all of the data out of the new master's WAL and data directories.
# Run {{sudo -u kudu kudu fs format}} using the new master's WAL and data 
directories.
# Run {{sudo -u kudu pbc edit}} on the various FS instance files (one in the 
WAL directory and one in each data directory) in the new master. Replace the 
UUID created during the format operation with the old master's UUID, which is 
expected by the two remaining masters. Note: {{kudu pbc edit}} expects the UUID 
as a base64-encoded string; you'll need to base64-encode the old UUID before 
splicing it in. 
# Run {{sudo -u kudu kudu remote_replica copy}} to copy the master tablet from 
one of the good masters.
# Start the new master.
# Wait for it to load the master tablet and join consensus. You can probably 
use ksck to determine when this will happen.




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to