Adar Dembo created KUDU-1358:
--------------------------------

             Summary: Following a master leader election, create table may fail
                 Key: KUDU-1358
                 URL: https://issues.apache.org/jira/browse/KUDU-1358
             Project: Kudu
          Issue Type: Bug
          Components: master
    Affects Versions: 0.7.0
            Reporter: Adar Dembo
            Assignee: Adar Dembo
            Priority: Critical


In the current multi-master design and implementation, tservers only heartbeat 
to the leader master. After a master leader election, there's a short window of 
time in which the new leader master may not be aware of the existence of some 
(or even all) of the tservers. Attempts to create a table during this window 
may fail, as the tservers known to the new leader master may be too few to 
satisfy the new table's replication factor. Whether the window exists in the 
first place depends on whether the new leader master had been leader before, 
and whether any of the tservers had sent heartbeats to it during that time.

Some possible solutions include:
# Modifying the heartbeat protocol so that tservers heartbeat to _all_ masters, 
leaders and followers alike. Doing this will ensure that the "soft state" 
belonging to any master is always up-to-date at the cost of network bandwidth 
lost to heartbeating. Additionally, changes may need to be made to ensure that 
a follower master can't cause a tserver to take any actions.
# Never actually failing a create table request due to too few tservers, 
instead allowing it to linger until such a time when more tservers exist. For 
this to actually be practical we'd need to allow clients to "cancel" a 
previously issued create table request.

Both approaches probably include additional ramifications; this problem needs 
to be thought through carefully.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to