Adar Dembo created KUDU-1358:
--------------------------------
Summary: Following a master leader election, create table may fail
Key: KUDU-1358
URL: https://issues.apache.org/jira/browse/KUDU-1358
Project: Kudu
Issue Type: Bug
Components: master
Affects Versions: 0.7.0
Reporter: Adar Dembo
Assignee: Adar Dembo
Priority: Critical
In the current multi-master design and implementation, tservers only heartbeat
to the leader master. After a master leader election, there's a short window of
time in which the new leader master may not be aware of the existence of some
(or even all) of the tservers. Attempts to create a table during this window
may fail, as the tservers known to the new leader master may be too few to
satisfy the new table's replication factor. Whether the window exists in the
first place depends on whether the new leader master had been leader before,
and whether any of the tservers had sent heartbeats to it during that time.
Some possible solutions include:
# Modifying the heartbeat protocol so that tservers heartbeat to _all_ masters,
leaders and followers alike. Doing this will ensure that the "soft state"
belonging to any master is always up-to-date at the cost of network bandwidth
lost to heartbeating. Additionally, changes may need to be made to ensure that
a follower master can't cause a tserver to take any actions.
# Never actually failing a create table request due to too few tservers,
instead allowing it to linger until such a time when more tservers exist. For
this to actually be practical we'd need to allow clients to "cancel" a
previously issued create table request.
Both approaches probably include additional ramifications; this problem needs
to be thought through carefully.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)