[kudu-CR] design-docs: multi-master for 1.0 release

Adar Dembo (Code Review) Tue, 05 Apr 2016 20:21:04 -0700

Adar Dembo has posted comments on this change.

Change subject: design-docs: multi-master for 1.0 release
......................................................................



Patch Set 4:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/2527/4/docs/design-docs/multi-master-1.0.md
File docs/design-docs/multi-master-1.0.md:

Line 74: 4. Replica instance. Includes a link to the tserver, the state of the 
replica
> it's true there is a TabletReplica map, but this is a struct that is never 
OK, will do.


Line 82: 3. Global: map of table name to table instance. Deleted tables are 
omitted,
> why not just table name to table id?
I'm just describing the status quo here.

We could refactor this map to be name-to-id, but that's orthogonal to adding 
multi-master support.


Line 84: 4. Global: map of tablet ID to tablet instance. All tablets are 
present,
> deleted tables are transient, right?
What do you mean? Also, this comment is on L84 which talks about the 
tablet-ID-to-tablet-instance map, so did you mean to ask about tables, or 
tablets?


Line 213: up-to-date before this change is made.
> We would also likely want a way to express how stale of a follower master w
Perhaps.

It's acceptable for state that is updated via heartbeats or RPC responses to 
"lag", since TS messages can arrive at the master whenever. For operations 
reading such state, follower masters are no worse than the leader.

But state that is canonical on the leader (e.g. the existence of a table) is 
different, and for that you're right that we need some way for the client to 
bound its expectations.

I'm not going to bother writing much about this since we've agreed it's out of 
scope for 1.0.


Line 222: TODO: JD says the code that detects the current master was partially 
removed
> from the Java client, I'm assuming.
Yes, I'll clarify that.


Line 246: #### Follower masters must treat heartbeats as no-ops
> seems contradictory - you aren't treating it quite as a no-op -- you're tre
Yeah, I'll clarify.


Line 275: master replicates via Raft before triggering a state change. It 
doesn't
> I think there's another alternative: on startup, the tserver sends a heartb
To clarify, does the TS use the term that is agreed upon by the majority of 
masters? Or the highest term amongst all of the masters, allowing enough time 
for each heartbeat to complete (or time out)?

In any event, I'd strongly prefer a solution that avoids any new persistence. 
Mike, what do you think of Todd's idea?


Line 278: allow for replicating no-ops (i.e. log entries that needn't be 
persisted).
> Actually, we do have this in Raft; heartbeats serve this function.
Your point about the existing invariant in administrative RPCs is well taken, 
and I agree that, largely speaking, it is upheld across the board. It is upheld 
by the CAS index when deleting replicas or changing configs, by the schema 
version when altering tablets, and by the tablet ID itself when creating new 
replicas.

I'm actually only worried about the first three calls to 
SendDeleteTabletRequest() in CatalogManager::HandleReportedTablet(), which do 
not provide the CAS index guaranteeing the invariant. Today the master retains 
persistent state for deleted tables for eternity, and if that remains true, the 
second call should be virtually unreachable as there's no way for there to be 
persistent state for the tablet but not its table.

The first call is still troublesome, as the rogue master in the example doesn't 
know about the heartbeating tablet at all. We could remove the call to 
SendDeleteTabletRequest() and just log the warning; maybe that's the right 
thing to do.

The third call should have enough state to provide the CAS index, but it 
currently doesn't. That might just be a factor of how the code is structured, 
or because the CAS index isn't guaranteed to exist (maybe the tablet hasn't 
elected a leader yet), or because it's not necessary; it isn't a problem for 
the rogue master example.

To summarize, perhaps modifying the semantics of the first call to 
SendDeleteTabletRequest() is enough? Maybe all of this hand-wringing about 
fencing is exaggerated?


Line 327: To fix KUDU-1353, the per-tablet replica locations should be rebuilt 
on
> Maybe we should do away with this information and just use the RaftConfigPB
The RaftConfigPB is insufficient if we're going to remove the 
RaftPeerPB::last_known_addr, unless you mean for every affected operation 
(GetTabletLocations, GetTableLocations, DeleteTablet, and 
PickLeaderReplica::PickReplica) to use the UUID in the RaftConfigPB to lookup 
the TSDescriptor on the spot.

That's doable, but it makes those operations more expensive (N*M lookups where 
N is the number of tablets and M is the replication factor).


Line 417:    1. Adding a new global in-memory unordered_set to "lock" a table's 
name
> yea, I tend to agree that for many operations, an in-progress state is usef
I disagree; we should limit the visibility of any in-progress state if the 
operation may still fail. For CreateTable(), if the leader can't replicate the 
table creation to its followers, the operation is aborted and all state is 
rolled back. Other clients should _not_ see that in-progress state, or if they 
do, it should be very clear that the operation may fail (it's not today).

That's what this lock is meant to do.


Line 420:       LockManager for this, as it has no real tablet dependencies
> per some conversation we had last week, I would rather avoid adding more lo
Generally agreed. I've already posted a diff that I think takes a defensible 
middle ground: no new locks, and in-progress-but-failable operations are 
exposed in a different way, so that if a client does see it, they'll know not 
to rely on it.


Line 423: #### Background task scanning
> should we evaluate getting rid of the "backgrund tasks" thread and instead 
We can certainly do that, but I think that's more of a refactoring exercise; it 
doesn't fix any bugs or add new features per se.


Line 462:    1. Remove step #5
> You're saying that on each iteration, we try to select the replicas again a
Yes, I think so. Basically, the song and dance in step 5.1 and 5.2 is necessary 
because we exposed t_new in steps 2.2 and 2.3. If t_new wasn't exposed until 
after replica selection, we wouldn't need steps 5.1 or 5.2 at all.


-- 
To view, visit http://gerrit.cloudera.org:8080/2527
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iad76012977a45370b72a04d608371cecf90442ef
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Jean-Daniel Cryans
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

[kudu-CR] design-docs: multi-master for 1.0 release

Reply via email to