Adar Dembo has posted comments on this change. Change subject: design-docs: multi-master for 1.0 release ......................................................................
Patch Set 2: (5 comments) http://gerrit.cloudera.org:8080/#/c/2527/2/docs/design-docs/multi-master-1.0.md File docs/design-docs/multi-master-1.0.md: Line 51: 3. Raft configuration changes, via tserver heartbeats. These include both > clarify: tablet server raft configuration changes Done Line 131: other more mundane reasons, raising the likelihood of a master crash. > It will typically fail because you're not the leader anymore. Not sure of a Thanks for the clarification. Line 181: memory and ignores all requested actions from an older term. > in-memory isn't really enough for safety. it has to be synced to disk to re Hmm, David and I talked about that, but we concluded that it was unnecessary. Thinking about it more, though, I think you may be right. Here's a contrived example: 1. Single tserver is heartbeating to three masters. Its "current master term" is 1. 2. Term 1 leader master is partitioned from the other masters. 3. Others masters elect a new leader in term 2. 4. TS heartbeats, notices term 2, resets "current master term". 5. A client creates a new table. 6. Term 2 leader master asks TS to create a tablet for the new table, which it does. 7. TS restarts. "Current master term" is now 0 (or -1, or whatever). 8. TS heartbeats to all masters but term 1 leader responds first. 9. Term 1 leader sees the new tablet, has no table for it, tells TS to delete it. 10. TS accepts the deletion because term 1 leader's term is compatible with the "current master term". 11. Term 1 leader finally notices that it's no longer the leader. 12. TS heartbeat to other masters goes through, "current master term" is now 2. The restart at step 7 gums up the works unless the current master term has been persisted. Note that a brand new TS is still vulnerable to taking actions dictated by a rogue master. Most destructive actions aren't applicable because the TS has no tablets (i.e. it can't delete a tablet), but could there be other problematic actions? Line 185: > 3. Would RAFT leader leases solve this problem? These no-op replications are almost an implicit leader lease implementation. Read v3 of this draft; I talk about that in more detail. Line 188: To help fix KUDU-495, the master should finish partially replicated > Could you be more specific here? I don't fully understand what's being roll Based on David's feedback I looked into whether we can avoid explicitly rolling forward persistent state altogether, and I think we can. See v3 of this draft for details. -- To view, visit http://gerrit.cloudera.org:8080/2527 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iad76012977a45370b72a04d608371cecf90442ef Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar Dembo <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
