Will Berkeley created KUDU-2748:
-----------------------------------

             Summary: Leader master erroneously tries to tablet copy to a 
follower master due to race at startup
                 Key: KUDU-2748
                 URL: https://issues.apache.org/jira/browse/KUDU-2748
             Project: Kudu
          Issue Type: Bug
    Affects Versions: 1.9.0
            Reporter: Will Berkeley


I was investigating KUDU-2734 and ran into a weird situation. The test runs 
with 3 masters and changes the value of a flag on the masters. To effect the 
change, it restarts the masters. Suppose the masters are labelled A, B, and C. 
Somewhat rarely (e.g. 8% of the time when run in TSAN with 8 stress threads), 
the following happens:

1. A and B are restarted successfully. They form a quorum and elect a leader 
(say A).
2. C is in the process of restarting. The ConsensusService is registered and C 
is accepting RPCs.
3. A sends C an UpdateConsensus RPC. However, C is still in the process of 
starting and has not yet initialized the systable. When C receives the 
UpdateConsensus call, as a result it responds with TABLET_NOT_FOUND, even 
though the proper response should be SERVICE_UNAVAILABLE.
4. A interprets TABLET_NOT_FOUND to mean that C needs to be copied to, and it 
tries forever to tablet copy to C. The copies never start because tablet copy 
is not implemented for masters.
5. C finishes its startup but does not receive UpdateConsensus from A because A 
is sending StartTableCopy requests. C calls pre-elections endlessly.

This effectively means the cluster is running with two masters until there is a 
leadership change. This caused the flakiness of 
KsckRemoteTest.TestClusterWithLocation because C never recognizes the 
leadership of A, so Ksck master consensus checks fail.

A regular tablet on a tablet server is not vulnerable to this. It's specific to 
how the master starts up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to