Alexey Serbin created KUDU-3530:
-----------------------------------
Summary: Add guardrails to prevent inconsistencies on attemps to
add multiple Kudu masters at once in a cluster
Key: KUDU-3530
URL: https://issues.apache.org/jira/browse/KUDU-3530
Project: Kudu
Issue Type: Improvement
Components: master
Reporter: Alexey Serbin
There have been a few reports on inconsistencies in the system catalog tablet's
Raft configuration upon trying to add multiple new masters at once into a Kudu
cluster. It seems the current implementation of the {{AddMaster}} RPC isn't
thread-safe, since the Raft configuration of the system catalog tablet became
corrupted after an attempt to add multiple extra masters at once (i.e. starting
multiple of those to-be-added-masters at once). The original Kudu master
reported an error like below upon next restart:
{noformat}
Invalid argument: RunMasterServer() failed: Unable to initialize catalog
manager: Failed to initialize sys tables async: on-disk master list (:0) and
provided master list (m1.my.org:7051, m2.my.org:7051, m3.my.org:7051) differ by
more than one address. Their symmetric difference is: :0, m1.my.org:7051,
m2.my.org:7051, m3.my.org:7051
{noformat}
It would be great to have guardrails preventing such a corruption.
Essentially, we should enforce the one-new-master-at-a-time invariant which the
current implementation implicitly assumes, but doesn't consistently enforce.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)