Adar Dembo has posted comments on this change. Change subject: design-docs: multi-master for 1.0 release ......................................................................
Patch Set 4: (3 comments) http://gerrit.cloudera.org:8080/#/c/2527/4/docs/design-docs/multi-master-1.0.md File docs/design-docs/multi-master-1.0.md: Line 275: master replicates via Raft before triggering a state change. It doesn't > I think something like this is as good as persistence, since we're piggy-ba We discussed this further on Slack and agreed that this sort of checking could be a replacement for an explicit lease check as described below. Line 278: allow for replicating no-ops (i.e. log entries that needn't be persisted). > With the leasing algorithm, I think what needs to be done is wait for an el We talked about this at length on Slack today. The discussion centered around: 1. What decisions can a master make unilaterally (i.e. without first replicating so as to establish consensus)? 2. When making such decisions, does the master at least consult past replicated state first? If it does, would "stale" state yield incorrect decisions? 3. If it doesn't (or can't) consult replicated state, are the external actions performed as a result of the decision safe? We identified the set of potentially problematic external actions as those taken by the master during tablet reports. We ruled out ChangeConfig; it is safe due to the use of CAS on the last change config opid (protects against two leader masters both trying to add a server), and because if the master somehow added a redundant server, in the worst case the new replica will be deleted the next time it heartbeats. That left DeleteReplica, which is called under the following circumstances: a. When the master can't find any record of the replica's tablet or its table, it is deleted. b. When the persistent state says that the replica (or its table) has been deleted, it is deleted. c. When the persistent state says that the replica is no longer part of the Raft config, it is deleted. d. When the persistent state includes replicas that aren't in the latest Raft config, they are deleted. Like ChangeConfig, cases c and d are protected with a CAS. Cases a and b are not, but b falls into category #2 above: if persistent state is consulted and the decision is made to delete a replica, that decision is correct and cannot become incorrect (i.e. under no circumstance would a tablet become "undeleted"). That leaves case a as the most problematic case. We could use leader leases as described earlier or "current term" checking to protect against it. Or, we could 1) continue our current policy of retaining persistent state of deleted tables/tablets forever, and 2) change the master not to delete tablets for which it has no records. If we always have the persistent state for deleted tables, case a should always fold into case b unless there's some drastic problem (e.g. tservers are heartbeating to the wrong master), in which case not deleting the tablets is probably the right thing to do. After further discussion, we resolved to take the "alternate" approach because it's cheapest to implement. This means that, for the time being, we won't implement leader leasing or current term checking. I'll update the design doc accordingly. Line 317: A batched logical operation includes one table entry and *N* tablet entries, : where *N* is the number of tablets in the table. When *N* is very large, it : is possible for the batch size to exceed the maximum size of a Kudu : RPC. This effect was measured in the creation of a table with 1000 tablets : and a simple three-column schema: the Write RPC clocked in at ~117 KB, a far : cry from the 8 MB maximum RPC size. Thus, a batch-based approach should not : be unreasonable for today's scale targets. > I'm having trouble following this. First it says when N is very large it c I agree this isn't as clearly written as it could have been. What I'm trying to say is this: the main concern around batching together an entire CreateTable() request is whether, for a table with many tablets, the replicated Write() to follower masters exceeds the maximum RPC size. But, I wrote a test to measure this with 1000 tablets, and found the resulting Write() to be an order of magnitude below the maximum size. So I conclude that this approach should be safe for the time being. To your second point, we're concerned with the size of the WriteRequestPB replicated by the leader master to each of the follower masters, not the CreateTableRequestPB sent by the client to the leader master. The WriteRequestPB includes one inserted row per created tablet, most of which is consumed by a SysTabletsEntryPB representing the state of the new tablet. I'll rewrite this paragraph to be more clear. -- To view, visit http://gerrit.cloudera.org:8080/2527 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iad76012977a45370b72a04d608371cecf90442ef Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar Dembo <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
