Jean-Daniel Cryans has submitted this change and it was merged. Change subject: KUDU-1516 ksck should check for more raft-related status issues (partial) ......................................................................
KUDU-1516 ksck should check for more raft-related status issues (partial) This patch improves ksck. The main way it does so is by adding "tablet server POV" information. ksck now gathers information about tablet replicas from the tablet servers and cross-references this information with the master metadata. This adds the following checks: * each tablet has a majority of replicas on live tablet servers * if a tablet has a majority of replicas on a live tablet server, then a majority of its tablets are in RUNNING state * the assignments of tablets to tablet servers in the master agrees with the assignment of tablet replicas reported by the tablet servers This patch does not include other desiderata from KUDU-1516, like a consensus canary or a write op canary. The code is also restructured quite a bit, so that all of the "fetch information from tablet servers" work happens up front in a single call. This paves the way a bit for a future enhancement in which all of these RPCs are done on a thread-pool (since it can be somewhat slow for large clusters). To try to improve performance for clusters with a lot of data, I also added a flag to the ListTablets RPC so that the response does not include schema information, which is both large and irrelevant for this use case. An example of the new output against a cluster with some dead tablet servers and broken tablets is available at: https://gist.github.com/toddlipcon/7ae677214988d064627bf1325f04dfac This patch is based on some earlier work by Will Berkeley. Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da Reviewed-on: http://gerrit.cloudera.org:8080/3632 Tested-by: Kudu Jenkins Reviewed-by: Jean-Daniel Cryans <[email protected]> --- M src/kudu/integration-tests/cluster_verifier.cc M src/kudu/master/master.proto M src/kudu/tools/CMakeLists.txt M src/kudu/tools/ksck-test.cc M src/kudu/tools/ksck.cc M src/kudu/tools/ksck.h M src/kudu/tools/ksck_remote-test.cc M src/kudu/tools/ksck_remote.cc M src/kudu/tools/ksck_remote.h M src/kudu/tools/kudu-ksck.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/tserver.proto 12 files changed, 402 insertions(+), 130 deletions(-) Approvals: Jean-Daniel Cryans: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/3632 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Will Berkeley <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: Will Berkeley <[email protected]>
