Adar Dembo has posted comments on this change.

Change subject: design-docs: multi-master for 1.0 release
......................................................................


Patch Set 4:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/2527/4/docs/design-docs/multi-master-1.0.md
File docs/design-docs/multi-master-1.0.md:

Line 275: master replicates via Raft before triggering a state change. It 
doesn't
> I think something like this is as good as persistence, since we're piggy-ba
We discussed this further on Slack and agreed that this sort of checking could 
be a replacement for an explicit lease check as described below.


Line 278: allow for replicating no-ops (i.e. log entries that needn't be 
persisted).
> With the leasing algorithm, I think what needs to be done is wait for an el
We talked about this at length on Slack today. The discussion centered around:
1. What decisions can a master make unilaterally (i.e. without first 
replicating so as to establish consensus)?
2. When making such decisions, does the master at least consult past replicated 
state first? If it does, would "stale" state yield incorrect decisions?
3. If it doesn't (or can't) consult replicated state, are the external actions 
performed as a result of the decision safe?

We identified the set of potentially problematic external actions as those 
taken by the master during tablet reports.

We ruled out ChangeConfig; it is safe due to the use of CAS on the last change 
config opid (protects against two leader masters both trying to add a server), 
and because if the master somehow added a redundant server, in the worst case 
the new replica will be deleted the next time it heartbeats.

That left DeleteReplica, which is called under the following circumstances:
a. When the master can't find any record of the replica's tablet or its table, 
it is deleted.
b. When the persistent state says that the replica (or its table) has been 
deleted, it is deleted.
c. When the persistent state says that the replica is no longer part of the 
Raft config, it is deleted.
d. When the persistent state includes replicas that aren't in the latest Raft 
config, they are deleted.

Like ChangeConfig, cases c and d are protected with a CAS. Cases a and b are 
not, but b falls into category #2 above: if persistent state is consulted and 
the decision is made to delete a replica, that decision is correct and cannot 
become incorrect (i.e. under no circumstance would a tablet become "undeleted").

That leaves case a as the most problematic case. We could use leader leases as 
described earlier or "current term" checking to protect against it. Or, we 
could 1) continue our current policy of retaining persistent state of deleted 
tables/tablets forever, and 2) change the master not to delete tablets for 
which it has no records. If we always have the persistent state for deleted 
tables, case a should always fold into case b unless there's some drastic 
problem (e.g. tservers are heartbeating to the wrong master), in which case not 
deleting the tablets is probably the right thing to do.

After further discussion, we resolved to take the "alternate" approach because 
it's cheapest to implement. This means that, for the time being, we won't 
implement leader leasing or current term checking.

I'll update the design doc accordingly.


Line 317: A batched logical operation includes one table entry and *N* tablet 
entries,
        : where *N* is the number of tablets in the table. When *N* is very 
large, it
        : is possible for the batch size to exceed the maximum size of a Kudu
        : RPC. This effect was measured in the creation of a table with 1000 
tablets
        : and a simple three-column schema: the Write RPC clocked in at ~117 
KB, a far
        : cry from the 8 MB maximum RPC size. Thus, a batch-based approach 
should not
        : be unreasonable for today's scale targets.
> I'm having trouble following this.  First it says when N is very large it c
I agree this isn't as clearly written as it could have been.

What I'm trying to say is this: the main concern around batching together an 
entire CreateTable() request is whether, for a table with many tablets, the 
replicated Write() to follower masters exceeds the maximum RPC size. But, I 
wrote a test to measure this with 1000 tablets, and found the resulting Write() 
to be an order of magnitude below the maximum size. So I conclude that this 
approach should be safe for the time being.

To your second point, we're concerned with the size of the WriteRequestPB 
replicated by the leader master to each of the follower masters, not the 
CreateTableRequestPB sent by the client to the leader master. The 
WriteRequestPB includes one inserted row per created tablet, most of which is 
consumed by a SysTabletsEntryPB representing the state of the new tablet.

I'll rewrite this paragraph to be more clear.


-- 
To view, visit http://gerrit.cloudera.org:8080/2527
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iad76012977a45370b72a04d608371cecf90442ef
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Jean-Daniel Cryans
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to