Adar Dembo has posted comments on this change. Change subject: design-docs: multi-master for 1.0 release ......................................................................
Patch Set 4: (13 comments) http://gerrit.cloudera.org:8080/#/c/2527/4/docs/design-docs/multi-master-1.0.md File docs/design-docs/multi-master-1.0.md: Line 74: 4. Replica instance. Includes a link to the tserver, the state of the replica > it's true there is a TabletReplica map, but this is a struct that is never OK, will do. Line 82: 3. Global: map of table name to table instance. Deleted tables are omitted, > why not just table name to table id? I'm just describing the status quo here. We could refactor this map to be name-to-id, but that's orthogonal to adding multi-master support. Line 84: 4. Global: map of tablet ID to tablet instance. All tablets are present, > deleted tables are transient, right? What do you mean? Also, this comment is on L84 which talks about the tablet-ID-to-tablet-instance map, so did you mean to ask about tables, or tablets? Line 213: up-to-date before this change is made. > We would also likely want a way to express how stale of a follower master w Perhaps. It's acceptable for state that is updated via heartbeats or RPC responses to "lag", since TS messages can arrive at the master whenever. For operations reading such state, follower masters are no worse than the leader. But state that is canonical on the leader (e.g. the existence of a table) is different, and for that you're right that we need some way for the client to bound its expectations. I'm not going to bother writing much about this since we've agreed it's out of scope for 1.0. Line 222: TODO: JD says the code that detects the current master was partially removed > from the Java client, I'm assuming. Yes, I'll clarify that. Line 246: #### Follower masters must treat heartbeats as no-ops > seems contradictory - you aren't treating it quite as a no-op -- you're tre Yeah, I'll clarify. Line 275: master replicates via Raft before triggering a state change. It doesn't > I think there's another alternative: on startup, the tserver sends a heartb To clarify, does the TS use the term that is agreed upon by the majority of masters? Or the highest term amongst all of the masters, allowing enough time for each heartbeat to complete (or time out)? In any event, I'd strongly prefer a solution that avoids any new persistence. Mike, what do you think of Todd's idea? Line 278: allow for replicating no-ops (i.e. log entries that needn't be persisted). > Actually, we do have this in Raft; heartbeats serve this function. Your point about the existing invariant in administrative RPCs is well taken, and I agree that, largely speaking, it is upheld across the board. It is upheld by the CAS index when deleting replicas or changing configs, by the schema version when altering tablets, and by the tablet ID itself when creating new replicas. I'm actually only worried about the first three calls to SendDeleteTabletRequest() in CatalogManager::HandleReportedTablet(), which do not provide the CAS index guaranteeing the invariant. Today the master retains persistent state for deleted tables for eternity, and if that remains true, the second call should be virtually unreachable as there's no way for there to be persistent state for the tablet but not its table. The first call is still troublesome, as the rogue master in the example doesn't know about the heartbeating tablet at all. We could remove the call to SendDeleteTabletRequest() and just log the warning; maybe that's the right thing to do. The third call should have enough state to provide the CAS index, but it currently doesn't. That might just be a factor of how the code is structured, or because the CAS index isn't guaranteed to exist (maybe the tablet hasn't elected a leader yet), or because it's not necessary; it isn't a problem for the rogue master example. To summarize, perhaps modifying the semantics of the first call to SendDeleteTabletRequest() is enough? Maybe all of this hand-wringing about fencing is exaggerated? Line 327: To fix KUDU-1353, the per-tablet replica locations should be rebuilt on > Maybe we should do away with this information and just use the RaftConfigPB The RaftConfigPB is insufficient if we're going to remove the RaftPeerPB::last_known_addr, unless you mean for every affected operation (GetTabletLocations, GetTableLocations, DeleteTablet, and PickLeaderReplica::PickReplica) to use the UUID in the RaftConfigPB to lookup the TSDescriptor on the spot. That's doable, but it makes those operations more expensive (N*M lookups where N is the number of tablets and M is the replication factor). Line 417: 1. Adding a new global in-memory unordered_set to "lock" a table's name > yea, I tend to agree that for many operations, an in-progress state is usef I disagree; we should limit the visibility of any in-progress state if the operation may still fail. For CreateTable(), if the leader can't replicate the table creation to its followers, the operation is aborted and all state is rolled back. Other clients should _not_ see that in-progress state, or if they do, it should be very clear that the operation may fail (it's not today). That's what this lock is meant to do. Line 420: LockManager for this, as it has no real tablet dependencies > per some conversation we had last week, I would rather avoid adding more lo Generally agreed. I've already posted a diff that I think takes a defensible middle ground: no new locks, and in-progress-but-failable operations are exposed in a different way, so that if a client does see it, they'll know not to rely on it. Line 423: #### Background task scanning > should we evaluate getting rid of the "backgrund tasks" thread and instead We can certainly do that, but I think that's more of a refactoring exercise; it doesn't fix any bugs or add new features per se. Line 462: 1. Remove step #5 > You're saying that on each iteration, we try to select the replicas again a Yes, I think so. Basically, the song and dance in step 5.1 and 5.2 is necessary because we exposed t_new in steps 2.2 and 2.3. If t_new wasn't exposed until after replica selection, we wouldn't need steps 5.1 or 5.2 at all. -- To view, visit http://gerrit.cloudera.org:8080/2527 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iad76012977a45370b72a04d608371cecf90442ef Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar Dembo <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
