[tools] Don't count tombstoned replicas in `kudu remote_replica check` Counting them as not running made the tool pretty useless on any realistic cluster, where it's expected to have some tombstones. For example, it wasn't possible to use this tool to tell when all the tablet replicas on a tablet server had finished bootstrapping.
I tested this against a server that previously failed the check because it had about 6000/10000 replicas tombstoned in STOPPED and INITIALIZED states. It now passes the check. Change-Id: Ibbdde768cce5f1d27064b7276944a585b41a2f03 Reviewed-on: http://gerrit.cloudera.org:8080/11962 Reviewed-by: Andrew Wong <[email protected]> Tested-by: Kudu Jenkins Reviewed-by: Attila Bukor <[email protected]> Project: http://git-wip-us.apache.org/repos/asf/kudu/repo Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/77c729d5 Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/77c729d5 Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/77c729d5 Branch: refs/heads/master Commit: 77c729d58360d56de7b2bdab9304b867faee02a8 Parents: b8e6498 Author: Will Berkeley <[email protected]> Authored: Tue Nov 20 00:37:00 2018 -0800 Committer: Will Berkeley <[email protected]> Committed: Tue Nov 20 10:12:53 2018 +0000 ---------------------------------------------------------------------- src/kudu/tablet/metadata.proto | 2 +- src/kudu/tools/tool_action_remote_replica.cc | 13 +++++++++---- 2 files changed, 10 insertions(+), 5 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/kudu/blob/77c729d5/src/kudu/tablet/metadata.proto ---------------------------------------------------------------------- diff --git a/src/kudu/tablet/metadata.proto b/src/kudu/tablet/metadata.proto index 0c3110d..8905249 100644 --- a/src/kudu/tablet/metadata.proto +++ b/src/kudu/tablet/metadata.proto @@ -165,7 +165,7 @@ enum TabletStatePB { // state Peers are available for client RPCs. RUNNING = 1; - // The tablet failed to for some reason. TabletReplica::error() will return + // The tablet failed for some reason. TabletReplica::error() will return // the reason for the failure. FAILED = 2; http://git-wip-us.apache.org/repos/asf/kudu/blob/77c729d5/src/kudu/tools/tool_action_remote_replica.cc ---------------------------------------------------------------------- diff --git a/src/kudu/tools/tool_action_remote_replica.cc b/src/kudu/tools/tool_action_remote_replica.cc index a62d695..9ffb195 100644 --- a/src/kudu/tools/tool_action_remote_replica.cc +++ b/src/kudu/tools/tool_action_remote_replica.cc @@ -200,7 +200,9 @@ Status CheckReplicas(const RunnerContext& context) { bool all_running = true; for (const auto& r : replicas) { const TabletStatusPB& rs = r.tablet_status(); - if (rs.state() != tablet::RUNNING) { + // It's ok if the tablet replica isn't running if it's tombstoned. + if (rs.state() != tablet::RUNNING && + rs.tablet_data_state() != tablet::TABLET_DATA_TOMBSTONED) { cerr << "Tablet id: " << rs.tablet_id() << " is " << tablet::TabletStatePB_Name(rs.state()) << endl; all_running = false; @@ -208,10 +210,10 @@ Status CheckReplicas(const RunnerContext& context) { } if (all_running) { - cout << "All tablets are running" << endl; + cout << "All tablet replicas are running" << endl; return Status::OK(); } - return Status::IllegalState("Not all tablets are running"); + return Status::IllegalState("Not all tablet replicas are running"); } Status DeleteReplica(const RunnerContext& context) { @@ -394,7 +396,10 @@ Status UnsafeChangeConfig(const RunnerContext& context) { unique_ptr<Mode> BuildRemoteReplicaMode() { unique_ptr<Action> check_replicas = ActionBuilder("check", &CheckReplicas) - .Description("Check if all tablet replicas on a Kudu Tablet Server are running") + .Description("Check if all tablet replicas on a Kudu tablet server are " + "running. Tombstoned replica do not count as not running, " + "because they are just records of the previous existence of " + "a replica.") .AddRequiredParameter({ kTServerAddressArg, kTServerAddressDesc }) .Build();
