[2/2] kudu git commit: [docs] Add admin workflow for recovering from disk failure

danburkert Tue, 11 Apr 2017 14:28:24 -0700

[docs] Add admin workflow for recovering from disk failure

I didn't document how to rebalance tablets onto the repaired tserver if
necessary, since the process is complicated and error prone, and we hope
to have a rebalancing tool in the future. These docs will quickly become
outdated when KUDU-616 is fixed, but I think it's worth it to document
since we frequently receive questions on the topic.


Change-Id: I6541bffc5e9546c523df610fd8c025dd05e403bf
Reviewed-on: http://gerrit.cloudera.org:8080/6606
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo <[email protected]>
Reviewed-by: Andrew Wong <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/87154f4a
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/87154f4a
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/87154f4a

Branch: refs/heads/master
Commit: 87154f4a39c77ab92d80f3effa58de3000921127
Parents: d917400
Author: Dan Burkert <[email protected]>
Authored: Mon Apr 10 17:46:36 2017 -0700
Committer: Dan Burkert <[email protected]>
Committed: Tue Apr 11 21:27:43 2017 +0000

----------------------------------------------------------------------
 docs/administration.adoc | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/87154f4a/docs/administration.adoc
----------------------------------------------------------------------
diff --git a/docs/administration.adoc b/docs/administration.adoc
index 7003160..813d097 100644
--- a/docs/administration.adoc
+++ b/docs/administration.adoc
@@ -585,3 +585,38 @@ be done with the following command:
 ----
 $ kudu cluster ksck --checksum_scan --tables IntegrationTestBigLinkedList 
master-01.example.com,master-02.example.com,master-03.example.com
 ----
+
+[[disk_failure_recovery]]
+=== Recovering from Disk Failure
+
+// TODO(dan): revise this once KUDU-616 is fixed.
+Kudu tablet servers are not resistent to disk failure. When a disk containing a
+data directory or the write-ahead log (WAL) dies, the entire tablet server must
+be rebuilt. Kudu will automatically re-replicate tablets on other servers after
+a tablet server fails, but manual intervention is needed in order to restore 
the
+failed tablet server to a running state.
+
+The first step to restoring a tablet server after a disk failure is to replace
+the failed disk, or remove the failed disk from the data-directory and/or WAL
+configuration. Next, the existing data directories and WAL directory must be
+removed. For example, if the tablet server is configured with
+`--fs_wal_dir=/data/0/kudu-tserver-wal` and
+`--fs_data_dirs=/data/1/kudu-tserver,/data/2/kudu-tserver`, the following
+commands will remove the existing data directories and WAL directory:
+
+[source,bash]
+----
+$ rm -rf /data/0/kudu-tserver-wal /data/1/kudu-tserver /data/2/kudu-tserver
+----
+
+After the WAL and data directories are removed, the tablet server process can 
be
+started. When Kudu is installed using system packages, `service` is typically
+used:
+
+[source,bash]
+----
+$ sudo service kudu-tserver start
+----
+
+Once the tablet server is running again, new tablet replicas will be created on
+it as necessary.

[2/2] kudu git commit: [docs] Add admin workflow for recovering from disk failure

Reply via email to