[docs] Add admin workflow for recovering from disk failure I didn't document how to rebalance tablets onto the repaired tserver if necessary, since the process is complicated and error prone, and we hope to have a rebalancing tool in the future. These docs will quickly become outdated when KUDU-616 is fixed, but I think it's worth it to document since we frequently receive questions on the topic.
Change-Id: I6541bffc5e9546c523df610fd8c025dd05e403bf Reviewed-on: http://gerrit.cloudera.org:8080/6606 Tested-by: Kudu Jenkins Reviewed-by: Adar Dembo <[email protected]> Reviewed-by: Andrew Wong <[email protected]> Project: http://git-wip-us.apache.org/repos/asf/kudu/repo Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/87154f4a Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/87154f4a Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/87154f4a Branch: refs/heads/master Commit: 87154f4a39c77ab92d80f3effa58de3000921127 Parents: d917400 Author: Dan Burkert <[email protected]> Authored: Mon Apr 10 17:46:36 2017 -0700 Committer: Dan Burkert <[email protected]> Committed: Tue Apr 11 21:27:43 2017 +0000 ---------------------------------------------------------------------- docs/administration.adoc | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/kudu/blob/87154f4a/docs/administration.adoc ---------------------------------------------------------------------- diff --git a/docs/administration.adoc b/docs/administration.adoc index 7003160..813d097 100644 --- a/docs/administration.adoc +++ b/docs/administration.adoc @@ -585,3 +585,38 @@ be done with the following command: ---- $ kudu cluster ksck --checksum_scan --tables IntegrationTestBigLinkedList master-01.example.com,master-02.example.com,master-03.example.com ---- + +[[disk_failure_recovery]] +=== Recovering from Disk Failure + +// TODO(dan): revise this once KUDU-616 is fixed. +Kudu tablet servers are not resistent to disk failure. When a disk containing a +data directory or the write-ahead log (WAL) dies, the entire tablet server must +be rebuilt. Kudu will automatically re-replicate tablets on other servers after +a tablet server fails, but manual intervention is needed in order to restore the +failed tablet server to a running state. + +The first step to restoring a tablet server after a disk failure is to replace +the failed disk, or remove the failed disk from the data-directory and/or WAL +configuration. Next, the existing data directories and WAL directory must be +removed. For example, if the tablet server is configured with +`--fs_wal_dir=/data/0/kudu-tserver-wal` and +`--fs_data_dirs=/data/1/kudu-tserver,/data/2/kudu-tserver`, the following +commands will remove the existing data directories and WAL directory: + +[source,bash] +---- +$ rm -rf /data/0/kudu-tserver-wal /data/1/kudu-tserver /data/2/kudu-tserver +---- + +After the WAL and data directories are removed, the tablet server process can be +started. When Kudu is installed using system packages, `service` is typically +used: + +[source,bash] +---- +$ sudo service kudu-tserver start +---- + +Once the tablet server is running again, new tablet replicas will be created on +it as necessary.
