Will Berkeley has posted comments on this change. Change subject: KUDU-1860: ksck doesn't identify tablets that are evicted but still in config ......................................................................
Patch Set 7: (54 comments) http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/consensus/consensus.proto File src/kudu/consensus/consensus.proto: Line 471: // The ids of the tablets. > Can we add: An empty list means return info for all tablets known to the ta Done PS7, Line 557: // Returns the consensus state for a tablet. : rpc GetConsensusState(GetConsensusStateRequestPB) returns (GetConsensusStateResponsePB); > Can we delete this function now? Gotta love deleting code. I don't think th Done Line 560: // Returns the consensus state for a set of tablets. > I think we should document whether this returns info for tombstoned tablets Done PS7, Line 561: GetConsensusStates > I think GetConsensusState() is still a valid name for this endpoint, or if I like GetConsensusState. http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/tools/ksck-test.cc File src/kudu/tools/ksck-test.cc: Line 70: const std::string& tablet_id, > warning: parameter 'tablet_id' is unused [misc-unused-parameters] Done Line 71: const Schema& schema, > warning: parameter 'schema' is unused [misc-unused-parameters] Done PS7, Line 232: > nit: spacing Done PS7, Line 236: InsertOrDie > Do we also need to worry about the duplicate UUID issue here? No. This is test code that's manually setting up a vanilla consensus configuration on each mock TS. If you wanted to test a case like that, you'd reach in and set it up in the specific test, as I do in the tests below for various situations Line 243: string tablet_id, > warning: the parameter 'tablet_id' is copied for each invocation but only u Done Line 243: string tablet_id, > I think this is complaining because we should either specify the input argu Kept it simple for the test and went with const string&. Line 394: FLAGS_check_consensus_info = false; > No need to reset your flags between tests due to KuduTest::flag_saver_ :) PS7, Line 401: shared_ptr<KsckTabletServer> ts > is it safe to declare this const shared_ptr<KsckTabletServer>& ts ? Done PS7, Line 401: "ts-id-0") > can we make "ts-id-0" and other identifier strings in here (except perhaps I don't think making constants for the whole file or for KsckTest makes sense just to cover these three specific tests, and the literals are actually used just once per test, so I don't think it makes it clearer or easier to store them in a variable. FWIW, it's also common to use a literal "ts-id-0" or similar in other tests in this file. Line 413: FLAGS_check_consensus_info = false; > also here Done Line 432: FLAGS_check_consensus_info = false; > and here Done Line 451: FLAGS_check_consensus_info = false; > and here Done Line 453: > Would also be interesting to test the duplicate TS UUID case. You mean the rare + bad case from L123 of ksck_remote.cc? That code path can't be tested here as that situation is dealt with by the RemoteKsckTabletServer, not by Ksck. What would happen, though, is that one of the duplicates' (same ts_id and tablet id) consensus info would be left out of the info provided to Ksck. I think it'd complicate the code to cover that rare edge case nicely, but it is easy to spit out a warning if we run into a duplicate. I'll do that. http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/tools/ksck.cc File src/kudu/tools/ksck.cc: Line 63: using std::left; > warning: using decl 'left' is unused [misc-unused-using-decls] Done PS7, Line 623: peers_from_master > nit: peer_uuids_from_master might be clearer since when I read peer I think Done PS7, Line 624: std::transform(tablet->replicas().cbegin(), tablet->replicas().cend(), : peers_from_master.begin(), : [](const std::shared_ptr<KsckTabletReplica>& r) { return r->ts_uuid(); }); > nit: This could be more simply written as a for-each loop: Done PS7, Line 627: boost::none > Doesn't the master know the OpId index? In its cmeta for the tablet I would guess so but that's not actually info that ksck gathers presently. KsckRemoteMaster is sneaky and gets its info about the master's POV on tablets by requesting scan tokens for a scan it'll never do. We'd have to refactor a fair bit more to add in the term/opid. PS7, Line 627: -1 > nit: magic number. Also, doesn't the master know the term? Magic number removed by using an enum. See above for why the master doesn't have the term here. PS7, Line 649: strings::Substitute("$0:$1", ts->uuid(), tablet->id()) > as noted elsewhere this seems hacky compared to: Done PS7, Line 658: peers_from_replica > same as above on naming Done PS7, Line 659: std::transform(peers.begin(), peers.end(), peers_from_replica.begin(), : [](const consensus::RaftPeerPB& pb) { return pb.permanent_uuid(); }); > nit: same as above regarding functional vs. iterative but I guess it's also Done PS7, Line 693: conflicting_states = std::count_if(replica_states.cbegin(), replica_states.cend(), : [&](const ReplicaState& r) -> bool { return r.consensus_state != master_config; }); > nit: this is quite wordy compared to: Done Line 728: } else if (conflicting_states > 0) { > It seems quite easy to achieve this on a large cluster, which may be moving I think that ksck "cries wolf" whenever there's nominal tablet movement going on in a cluster-- under-replication and consensus conflicts. Maybe later we can do a more thorough job of separating out meaningful warnings; for now, users can re-run ksck to see if the problem persists (targeting a specific table or tablet if needed). PS7, Line 778: all_peers > nit: peer_uuid_mapping? Done PS7, Line 795: // A lambda to print out consensus from the point of view of the master and each replica. : auto print_config = [&all_peers](const string& name, : const boost::optional<KsckConsensusState>& config) > Perhaps this lambda can be extracted into a function? Ksck::VerifyTablet() Done Line 815: if (config->term() >= 0) { > Seems like term should be an optional Done PS7, Line 817: config->opid_index().get() >= 0 > Do we ever expect a negative number here? I don't think that's possible. Done PS7, Line 820: Out() << Substitute(" $0.$1", config->term(), opid_str); > I don't like this layout because this looks like this was the last log entr Done PS7, Line 834: "master" > perhaps "config from master" to make this easier to understand? Done http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/tools/ksck.h File src/kudu/tools/ksck.h: PS7, Line 97: bool pending > can we make this an enum? Good idea. I think if we have three types: MASTER, COMMITTED, and PENDING, it should make the class easier to understand. Line 109: bool pending() const { > Needs API docs on the methods in this class, just a one liner like: N/A. Removed method PS7, Line 117: boost::optional<int64_t> > const boost::optional<int64_t>& ? N/A. Removed method PS7, Line 121: const boost::optional<std::string> > const boost::optional<std::string>& ? N/A. Removed method Line 129: bool operator==(const KsckConsensusState &other) const { > Would be useful to doc this equality operator Done Line 130: bool same_except_term = leader_uuid_ == other.leader_uuid_ && peers_ == other.peers_; > why not check pending_? pending_ removed in favor of enum type; type is now checked Line 131: if (term_ >= 0 && other.term_ >= 0) { > why not just term_ == other.term_ ? N/A with addition of MASTER type Line 137: private: > It looks like we can make all of these member variables const because they' I made it struct since that seems simpler. I want the copy assign operator because of ksck's logic of building a replica state one bit at a time, so I didn't make the fields const. Line 139: int64_t term_; > Since we are using a magic number -1 somewhere in ksck.cc maybe we should m Done PS7, Line 142: peers_ > Not sure what this is. How about peer_uuids_ ? Done PS7, Line 142: std::set > I guess using a std::set here saves us a sort if we want to print these pee Yup. That's the idea. PS7, Line 238: std::string > Hmm, I think it would be a little cleaner if we keyed this map by std::pair Done PS7, Line 247: FetchConsensusInfo > nit: Maybe FetchConsensusState() would be more clear Done PS7, Line 444: CONFLICTED > nit: maybe CONSENSUS_MISMATCH? Also add a comment to document its meaning. Done http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/tools/ksck_remote.cc File src/kudu/tools/ksck_remote.cc: Line 116: for (const auto& entry : tablet_status_map_) { > Is there a reason we don't just ask for all of them? No reason. My bad-- ksck lets you filter which tables to look at, but that is done at a higher level than this, so we will always gather all the info here. PS7, Line 122: Don't crash > Seems like we should warn in this case? Or perhaps detect this and warn els Came to the same conclusion while looking at another one of your comments before I saw this one. Done. http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/tools/ksck_remote.h File src/kudu/tools/ksck_remote.h: PS7, Line 58: FetchConsensusInfo > nit: Maybe FetchConsensusState()? Done. I also renamed the flag from check_consensus_info to check_consensus_state. http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/tools/tool_action_cluster.cc File src/kudu/tools/tool_action_cluster.cc: PS7, Line 138: check_consensus_info > how about just --consensus ? Done http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/tserver/tablet_service.cc File src/kudu/tserver/tablet_service.cc: PS7, Line 1067: req->tablet_ids() > Is this is empty we can get them all via tablet_manager_->GetTabletReplicas Done Line 1069: if (!tablet_manager_->GetTabletReplica(tablet_id, &tablet_replica).ok()) { > As mentioned elsewhere, this excludes tombstoned tablets. I actually think I agree we should leave it out and perhaps add something in later to get it later. PS7, Line 1079: ConsensusStatePB state = consensus->ConsensusState(); : tablet_info->mutable_cstate()->Swap(&state); > do this in one line like: Done -- To view, visit http://gerrit.cloudera.org:8080/6772 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I16e4de09821b372c3773b4ade3fd9e37ab818808 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Will Berkeley <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: Will Berkeley <[email protected]> Gerrit-HasComments: Yes
