Repository: kudu Updated Branches: refs/heads/master bd24f04fb -> 9d03677e4
Add ksck section to admin guide common workflows I've often wanted this when helping people through ksck. Change-Id: I9631337b113d2c67be0057f728c68f792e8a4fd6 Reviewed-on: http://gerrit.cloudera.org:8080/6598 Reviewed-by: Adar Dembo <[email protected]> Tested-by: Kudu Jenkins Project: http://git-wip-us.apache.org/repos/asf/kudu/repo Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/9d03677e Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/9d03677e Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/9d03677e Branch: refs/heads/master Commit: 9d03677e45dfa5722d816645200071e4d78fb845 Parents: bd24f04 Author: Dan Burkert <[email protected]> Authored: Fri Apr 7 17:15:25 2017 -0700 Committer: Dan Burkert <[email protected]> Committed: Mon Apr 10 19:48:00 2017 +0000 ---------------------------------------------------------------------- docs/administration.adoc | 74 ++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 70 insertions(+), 4 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/kudu/blob/9d03677e/docs/administration.adoc ---------------------------------------------------------------------- diff --git a/docs/administration.adoc b/docs/administration.adoc index d532561..7003160 100644 --- a/docs/administration.adoc +++ b/docs/administration.adoc @@ -367,8 +367,8 @@ are working properly, consider performing the following sanity checks: be listed there with one master in the LEADER role and the others in the FOLLOWER role. The contents of /masters on each master should be the same. -* Run a Kudu system check (ksck) on the cluster using the `kudu` command line tool. Help for ksck - can be viewed via `kudu cluster ksck --help`. +* Run a Kudu system check (ksck) on the cluster using the `kudu` command line + tool. See <<ksck>> for more details. === Recovering from a dead Kudu Master in a Multi-Master Deployment @@ -517,5 +517,71 @@ consider performing the following sanity checks: be listed there with one master in the LEADER role and the others in the FOLLOWER role. The contents of /masters on each master should be the same. -* Run a Kudu system check (ksck) on the cluster using the `kudu` command line tool. Help for ksck - can be viewed via `kudu cluster ksck --help`. +* Run a Kudu system check (ksck) on the cluster using the `kudu` command line + tool. See <<ksck>> for more details. + +[[ksck]] +=== Checking Cluster Health with `ksck` + +The `kudu` CLI includes a tool named `ksck` which can be used for checking +cluster health and data integrity. `ksck` will identify issues such as +under-replicated tablets, unreachable tablet servers, or tablets without a +leader. + +`ksck` should be run from the command line, and requires the full list of master +addresses to be specified: + +[source,bash] +---- +$ kudu cluster ksck master-01.example.com,master-02.example.com,master-03.example.com +---- + +To see a full list of the options available with `ksck`, use the `--help` flag. +If the cluster is healthy, `ksck` will print a success message, and return a +zero (success) exit status. + +---- +Connected to the Master +Fetched info from all 1 Tablet Servers +Table IntegrationTestBigLinkedList is HEALTHY (1 tablet(s) checked) + +The metadata for 1 table(s) is HEALTHY +OK +---- + +If the cluster is unhealthy, for instance if a tablet server process has +stopped, `ksck` will report the issue(s) and return a non-zero exit status: + +---- +Connected to the Master +WARNING: Unable to connect to Tablet Server 8a0b66a756014def82760a09946d1fce +(tserver-01.example.com:7050): Network error: could not send Ping RPC to server: Client connection negotiation failed: client connection to 192.168.0.2:7050: connect: Connection refused (error 61) +WARNING: Fetched info from 0 Tablet Servers, 1 weren't reachable +Tablet ce3c2d27010d4253949a989b9d9bf43c of table 'IntegrationTestBigLinkedList' +is unavailable: 1 replica(s) not RUNNING + 8a0b66a756014def82760a09946d1fce (tserver-01.example.com:7050): TS unavailable [LEADER] + + Table IntegrationTestBigLinkedList has 1 unavailable tablet(s) + + WARNING: 1 out of 1 table(s) are not in a healthy state + ================== + Errors: + ================== + error fetching info from tablet servers: Network error: Not all Tablet Servers are reachable + table consistency check error: Corruption: 1 table(s) are bad + + FAILED + Runtime error: ksck discovered errors +---- + +To verify data integrity, the optional `--checksum_scan` flag can be set, which +will ensure the cluster has consistent data by scanning each tablet replica and +comparing results. The `--tables` or `--tablets` flags can be used to limit the +scope of the checksum scan to specific tables or tablets, respectively. For +example, checking data integrity on the `IntegrationTestBigLinkedList` table can +be done with the following command: + +[source,bash] +---- +$ kudu cluster ksck --checksum_scan --tables IntegrationTestBigLinkedList master-01.example.com,master-02.example.com,master-03.example.com +----
