Mike Percy has posted comments on this change. Change subject: design-docs: multi-master for 1.0 release ......................................................................
Patch Set 4: (8 comments) http://gerrit.cloudera.org:8080/#/c/2527/4/docs/design-docs/multi-master-1.0.md File docs/design-docs/multi-master-1.0.md: Line 82: 3. Global: map of table name to table instance. Deleted tables are omitted, why not just table name to table id? Line 84: 4. Global: map of tablet ID to tablet instance. All tablets are present, deleted tables are transient, right? Line 213: up-to-date before this change is made. We would also likely want a way to express how stale of a follower master we are willing to accept, possibly based on hybrid time. Line 215: It would be nice to implement this for Kudu 1.0, but it's not a strict Yeah I don't think this feature would be driven by a pressing need. Line 222: TODO: JD says the code that detects the current master was partially removed from the Java client, I'm assuming. Line 278: allow for replicating no-ops (i.e. log entries that needn't be persisted). Actually, we do have this in Raft; heartbeats serve this function. Some options along these lines to assert leadership: 1. Add a local-only CheckLeadership() call to the Consensus class that returns true if: (a) we believe we are the leader and (b) we have heartbeated to a majority in the last election timeout. 2. alternatively, CheckLeadership() could be more conservative: trigger an immediate heartbeat to a majority if we are, say, half way through our election timeout, then block waiting for a majority response. This would give us more time to act within our election timeout "lease". This whole approach to leases is a bit imprecise and due to OS hiccups and network delays does not guarantee that a rogue master wouldn't be able to command a tserver... it just significantly reduces the window in which that could occur. However it may be safe, given the way we have designed the administrative RPC APIs from master -> tserver. I think the requirement for such RPC APIs is that they need to tolerate, and reject, command from a master that has stale information relative to the Raft config of the tablet being commanded. For example, if a tserver gets a ChangeConfig() RPC from a master that specifies an old config, the leader tserver will reject the command from the master, because the master has fallen out of date. As long as we maintain this invariant across all master -> tserver RPCs, I think the master leader lease approach should be sufficient for safety. What I haven't done is go through all the RPCs and ensure that this holds, and we should probably do that. If this logic is the basis for the safety design of master failover, we should also ensure that invariant makes it into this doc. Line 417: 1. Adding a new global in-memory unordered_set to "lock" a table's name what does this "lock" actually do? Prevent visibility? We want visibility on dashboards if possible, for administrators. Line 462: 1. Remove step #5 You're saying that on each iteration, we try to select the replicas again and only fail on timeout like in step 2? If so, this seems OK to me. -- To view, visit http://gerrit.cloudera.org:8080/2527 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iad76012977a45370b72a04d608371cecf90442ef Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar Dembo <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
