Repository: kudu Updated Branches: refs/heads/master 816bc6fd8 -> fd1ffd0fb
[docs] Add tip on dealing with planned TS downtime Rendering available at https://github.com/wdberkeley/kudu/blob/docfollowerunavailablesec/docs/administration.adoc. Change-Id: I55a992a00f35945187e02c55594edc6e261a72c4 Reviewed-on: http://gerrit.cloudera.org:8080/11486 Reviewed-by: Andrew Wong <[email protected]> Reviewed-by: Grant Henke <[email protected]> Tested-by: Will Berkeley <[email protected]> Project: http://git-wip-us.apache.org/repos/asf/kudu/repo Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/3a033d82 Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/3a033d82 Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/3a033d82 Branch: refs/heads/master Commit: 3a033d829cd6aab17995b68371e7e136c47cc9b8 Parents: 816bc6f Author: Will Berkeley <[email protected]> Authored: Thu Sep 20 12:23:41 2018 -0700 Committer: Will Berkeley <[email protected]> Committed: Thu Sep 20 21:32:51 2018 +0000 ---------------------------------------------------------------------- docs/administration.adoc | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/kudu/blob/3a033d82/docs/administration.adoc ---------------------------------------------------------------------- diff --git a/docs/administration.adoc b/docs/administration.adoc index 74de5a0..b176f58 100644 --- a/docs/administration.adoc +++ b/docs/administration.adoc @@ -1120,6 +1120,43 @@ a node onto another machine. . Start all Kudu processes in the cluster. +[[minimizing_cluster_disruption_during_temporary_single_ts_downtime]] +=== Minimizing cluster disruption during temporary planned downtime of a single tablet server + +If a single tablet server is brought down temporarily in a healthy cluster, all +tablets will remain available and clients will function as normal, after +potential short delays due to leader elections. However, if the downtime lasts +for more than `--follower_unavailable_considered_failed_sec` (default 300) +seconds, the tablet replicas on the down tablet server will be replaced by new +replicas on available tablet servers. This will cause stress on the cluster +as tablets re-replicate and, if the downtime lasts long enough, significant +reduction in the number of replicas on the down tablet server. This may require +the rebalancer to fix. + +To work around this, increase `--follower_unavailable_considered_failed_sec` on +all tablet servers so the amount of time before re-replication will start is +longer than the expected downtime of the tablet server, including the time it +takes the tablet server to restart and bootstrap its tablet replicas. To do +this, run the following command for each tablet server: + +[source,bash] +---- +$ sudo -u kudu kudu tserver set_flag <tserver_address> follower_unavailable_considered_failed_sec <num_seconds> +---- + +where `<num_seconds>` is the number of seconds that will encompass the downtime. +Once the downtime is finished, reset the flag to its original value. + +---- +$ sudo -u kudu kudu tserver set_flag <tserver_address> follower_unavailable_considered_failed_sec <original_value> +---- + +WARNING: Be sure to reset the value of `--follower_unavailable_considered_failed_sec` +to its original value. + +NOTE: On Kudu versions prior to 1.8, the `--force` flag must be provided in the above +commands. + [[rebalancer_tool]] === Running the tablet rebalancing tool
