Adar Dembo has posted comments on this change.

Change subject: design-docs: multi-master for 1.0 release
......................................................................


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/2527/2/docs/design-docs/multi-master-1.0.md
File docs/design-docs/multi-master-1.0.md:

Line 51: 3. Raft configuration changes, via tserver heartbeats. These include 
both
> clarify: tablet server raft configuration changes
Done


Line 131: other more mundane reasons, raising the likelihood of a master crash.
> It will typically fail because you're not the leader anymore. Not sure of a
Thanks for the clarification.


Line 181:       memory and ignores all requested actions from an older term.
> in-memory isn't really enough for safety. it has to be synced to disk to re
Hmm, David and I talked about that, but we concluded that it was unnecessary. 
Thinking about it more, though, I think you may be right. Here's a contrived 
example:
1. Single tserver is heartbeating to three masters. Its "current master term" 
is 1.
2. Term 1 leader master is partitioned from the other masters.
3. Others masters elect a new leader in term 2.
4. TS heartbeats, notices term 2, resets "current master term".
5. A client creates a new table.
6. Term 2 leader master asks TS to create a tablet for the new table, which it 
does.
7. TS restarts. "Current master term" is now 0 (or -1, or whatever).
8. TS heartbeats to all masters but term 1 leader responds first.
9. Term 1 leader sees the new tablet, has no table for it, tells TS to delete 
it.
10. TS accepts the deletion because term 1 leader's term is compatible with the 
"current master term".
11. Term 1 leader finally notices that it's no longer the leader.
12. TS heartbeat to other masters goes through, "current master term" is now 2.

The restart at step 7 gums up the works unless the current master term has been 
persisted.

Note that a brand new TS is still vulnerable to taking actions dictated by a 
rogue master. Most destructive actions aren't applicable because the TS has no 
tablets (i.e. it can't delete a tablet), but could there be other problematic 
actions?


Line 185: 
> 3. Would RAFT leader leases solve this problem?
These no-op replications are almost an implicit leader lease implementation. 
Read v3 of this draft; I talk about that in more detail.


Line 188: To help fix KUDU-495, the master should finish partially replicated
> Could you be more specific here? I don't fully understand what's being roll
Based on David's feedback I looked into whether we can avoid explicitly rolling 
forward persistent state altogether, and I think we can. See v3 of this draft 
for details.


-- 
To view, visit http://gerrit.cloudera.org:8080/2527
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iad76012977a45370b72a04d608371cecf90442ef
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to