Will Berkeley has posted comments on this change. Change subject: [WIP] KUDU-1860: ksck doesn't identify tablets that are evicted but still in config ......................................................................
Patch Set 1: (8 comments) I implemented a GetAllConsensusStates RPC and used it instead of one GetConsensusState RPC per tablet replica; however, this makes the new features available only for servers >= 1.4. Should we stick both forms of logic in so it works on older servers? http://gerrit.cloudera.org:8080/#/c/6772/1/src/kudu/tools/ksck.cc File src/kudu/tools/ksck.cc: PS1, Line 210: s.ToString() > This s refers to the one from line 204, when I assume it should be the one Done PS1, Line 616: shared_ptr<KsckTabletReplica> r > const shared_ptr<KsckTabletReplica>& r Done PS1, Line 624: std::shared_ptr<KsckTabletReplica> r > same Done Line 716: } else if (consensus_conflict) { > If we catch a replica in the middle of a config change then we can still se Yes, that's part of a core problem with these consensus checks: we can't retrieve the information from all servers simultaneously, and config changes can happen as a normal part of the tablet's operation. However, AFAICT consensus doesn't track any timestamps on its configs, so we'd have to introduce that mechanism just for these checks. Would it be better to, instead of surfacing all conflicts, have some known bad situations that we check for and alert on? We could additionally allow ALL info to be printed, so a knowledgable admin could use that to investigate misbehaving tablets that our heuristics don't cover. http://gerrit.cloudera.org:8080/#/c/6772/1/src/kudu/tools/ksck.h File src/kudu/tools/ksck.h: PS1, Line 100: peers > you should also std::move(peers) here Tidy bot caught that on patch set 2; it's done in set 3. PS1, Line 124: int64_t term_; : boost::optional<string> leader_uuid_; : vector<string> peers_; > it looks like these can all be const If you make them const it breaks the assignment at ksck.cc L663 since the copy-assign operator is implicitly deleted. I think the compiler with copy-elide there. http://gerrit.cloudera.org:8080/#/c/6772/1/src/kudu/tools/ksck_remote.cc File src/kudu/tools/ksck_remote.cc: PS1, Line 114: for (auto it = tablet_status_map_.begin(); it != tablet_status_map_.end(); it++) > Since you're not mutating tablet_status_map_, it's more idiomatic to write Done PS1, Line 120: InsertOrDi > It's not safe to trust that multiple remotes won't have the same UUID and t Good point. -- To view, visit http://gerrit.cloudera.org:8080/6772 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I16e4de09821b372c3773b4ade3fd9e37ab818808 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Will Berkeley <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Will Berkeley <[email protected]> Gerrit-HasComments: Yes
