This is an automated email from the ASF dual-hosted git repository. granthenke pushed a commit to branch branch-1.12.x in repository https://gitbox.apache.org/repos/asf/kudu.git
commit 1566ae27fc6ec0143dc4b2b809e2971665bca203 Author: Andrew Wong <[email protected]> AuthorDate: Fri May 15 16:31:07 2020 -0700 docs: add docs to orchestrate a rolling restart Change-Id: I268928ccdf23863880349716b9e5a848a0e443bb Reviewed-on: http://gerrit.cloudera.org:8080/15930 Tested-by: Kudu Jenkins Reviewed-by: Alexey Serbin <[email protected]> Reviewed-by: Grant Henke <[email protected]> (cherry picked from commit 161dec90a10aa96fcc1d2ad789743f6bb37e0d48) Reviewed-on: http://gerrit.cloudera.org:8080/15943 Reviewed-by: Hao Hao <[email protected]> --- docs/administration.adoc | 37 +++++++++++++++++++++++++++++++++++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/docs/administration.adoc b/docs/administration.adoc index 48f7d71..2c12ac7 100644 --- a/docs/administration.adoc +++ b/docs/administration.adoc @@ -1202,9 +1202,9 @@ the new directory. WARNING: All of the command line steps below should be executed as the Kudu UNIX user, typically `kudu`. -. Establish a +. Use `ksck` to ensure the cluster is healthy, and establish a <<minimizing_cluster_disruption_during_temporary_single_ts_downtime,maintenance - window>> and shut down the tablet server. + window>> to bring the tablet server offline. . Run the tool with the desired directory configuration flags. For example, if a cluster was set up with `--fs_wal_dir=/wals`, `--fs_metadata_dir=/meta`, and @@ -1532,6 +1532,39 @@ to its original value. NOTE: On Kudu versions prior to 1.8, the `--force` flag must be provided in the above `set_flag` commands. +[[rolling_restart]] +=== Orchestrating a rolling restart with no downtime + +As of Kudu 1.12, tooling is available to restart a cluster with no downtime. To +perform such a "rolling restart", perform the following sequence: + +. Restart the master(s) one-by-one. If there is only a single master, this may + cause brief interference with on-going workloads. +. Starting with a single tablet server, put the tablet server into + <<minimizing_cluster_disruption_during_temporary_single_ts_downtime,maintenance + mode>> by using the `kudu tserver state enter_maintenance` tool. +. Start quiescing the tablet server using the `kudu tserver quiesce start` + tool. This will signal to Kudu to stop hosting leaders on the specified + tablet server and to redirect new scan requests to other tablet servers. +. Periodically run `kudu tserver quiesce start` with the + `--error_if_not_fully_quiesced` option, until it returns success, indicating + that all leaders have been moved away from the tablet server and all on-going + scans have completed. +. Restart the tablet server. +. Periodically run `ksck` until the cluster is reported to be healthy. +. Exit maintenance mode on the tablet server by running `kudu tserver state + exit_maintenance`. This will allow new tablet replicas to be placed on the + tablet server. +. Repeat these steps for all tablet servers in the cluster. + +NOTE: If running with <<rack_awareness,rack awareness>>, the above steps can be +performed restarting multiple tablet servers within a single rack at the same +time. Users should use `ksck` to ensure the location assignment policy is +enforced while going through these steps, and that no more than a single +location is restarted at the same time. At least three locations should be +defined in the cluster to safely restart multiple tablet service within one +location. + [[rebalancer_tool]] === Running the tablet rebalancing tool
