Repository: aurora
Updated Branches:
  refs/heads/master 5e5bcbcbd -> a6d9288ce


Updated restore instructions to reflect using offline rehydration tool.

Rewrote the instructions for recovering from backup based upon using Bill's 
tool to recover with all instances offline.

Reviewed at https://reviews.apache.org/r/67705/


Project: http://git-wip-us.apache.org/repos/asf/aurora/repo
Commit: http://git-wip-us.apache.org/repos/asf/aurora/commit/a6d9288c
Tree: http://git-wip-us.apache.org/repos/asf/aurora/tree/a6d9288c
Diff: http://git-wip-us.apache.org/repos/asf/aurora/diff/a6d9288c

Branch: refs/heads/master
Commit: a6d9288ce506b0f1761c376cd50f1ed7d2851d6c
Parents: 5e5bcbc
Author: Renan DelValle <[email protected]>
Authored: Fri Jun 29 15:36:03 2018 -0700
Committer: Renan DelValle <[email protected]>
Committed: Fri Jun 29 15:36:03 2018 -0700

----------------------------------------------------------------------
 docs/operations/backup-restore.md | 97 +++++++++++++++-------------------
 1 file changed, 43 insertions(+), 54 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/aurora/blob/a6d9288c/docs/operations/backup-restore.md
----------------------------------------------------------------------
diff --git a/docs/operations/backup-restore.md 
b/docs/operations/backup-restore.md
index 15e6dd2..53bea5d 100644
--- a/docs/operations/backup-restore.md
+++ b/docs/operations/backup-restore.md
@@ -18,74 +18,63 @@ so any tasks that have been rescheduled since the backup 
was taken will be kille
 Instructions below have been verified in [Vagrant 
environment](../getting-started/vagrant.md) and with minor
 syntax/path changes should be applicable to any Aurora cluster.
 
-## Preparation
-
 Follow these steps to prepare the cluster for restoring from a backup:
 
-* Stop all scheduler instances
-
-* Consider blocking external traffic on a port defined in `-http_port` for all 
schedulers to
-prevent users from interacting with the scheduler during the restoration 
process. This will help
-troubleshooting by reducing the scheduler log noise and prevent users from 
making changes that will
-be erased after the backup snapshot is restored.
+##  Preparation
 
-* Configure `aurora_admin` access to run all commands listed in
-  [Restore from backup](#restore-from-backup) section locally on the leading 
scheduler:
-  * Make sure the 
[clusters.json](../reference/client-cluster-configuration.md) file configured to
-    access scheduler directly. Set `scheduler_uri` setting and remove `zk`. 
Since leader can get
-    re-elected during the restore steps, consider doing it on all scheduler 
replicas.
-  * Depending on your particular security approach you will need to either 
turn off scheduler
-    authorization by removing scheduler `-http_authentication_mechanism` flag 
or make sure the
-    direct scheduler access is properly authorized. E.g.: in case of Kerberos 
you will need to make
-    a `/etc/hosts` file change to match your local IP to the scheduler URL 
configured in keytabs:
+* Stop all scheduler instances.
 
-        <local_ip> <scheduler_domain_in_keytabs>
+* Pick a backup to use for rehydrating the mesos-replicated log. Backups can 
be found in the
+directory given to the scheduler as the `-backup_dir` argument. Backups are 
stored in the format
+`scheduler-backup-<yyyy-MM-dd-HH-mm>`.
 
-* Next steps are required to put scheduler into a partially disabled state 
where it would still be
-able to accept storage recovery requests but unable to schedule or change task 
states. This may be
-accomplished by updating the following scheduler configuration options:
-  * Set `-mesos_master_address` to a non-existent zk address. This will 
prevent scheduler from
-    registering with Mesos. E.g.: 
`-mesos_master_address=zk://localhost:1111/mesos/master`
-  * `-max_registration_delay` - set to sufficiently long interval to prevent 
registration timeout
-    and as a result scheduler suicide. E.g: `-max_registration_delay=360mins`
-  * Make sure `-reconciliation_initial_delay` option is set high enough (e.g.: 
`365days`) to
-    prevent accidental task GC. This is important as scheduler will attempt to 
reconcile the cluster
-    state and will kill all tasks when restarted with an empty Mesos 
replicated log.
+* If running the Aurora Scheduler in HA mode, pick a single scheduler instance 
to rehydrate.
 
-* Restart all schedulers
+* Locate the `recovery-tool` in your setup. If Aurora was installed using a 
Debian package
+generated by our `aurora-packaging` script, the recovery tool can be found
+in `/usr/share/aurora/bin/recovery-tool`.
 
-## Cleanup and re-initialize Mesos replicated log
+## Cleanup
 
-Get rid of the corrupted files and re-initialize Mesos replicated log:
+* Delete (or move) the Mesos replicated log path for each scheduler instance. 
The location of the
+Mesos replicated log file path can be found by looking at the value given to 
the flag
+`-native_log_file_path` for each instance.
 
-* Stop schedulers
-* Delete all files under `-native_log_file_path` on all schedulers
-* Initialize Mesos replica's log file: `sudo mesos-log initialize 
--path=<-native_log_file_path>`
-* Start schedulers
+* Initialize the Mesos replicated log files using the mesos-log tool:
+```
+sudo su -u <USER> mesos-log initialize --path=<native_log_file_path>
+```
+Where `USER` is the user under which the scheduler instance will be run. For 
installations using
+Debian packages, the default user will be `aurora`. You may alternatively 
choose to specify
+a group as well by passing the `-g <GROUP>` option to `su`.
+Note that if the user under which the Aurora scheduler instance is run _does 
not_ have permissions
+to read this directory and the files it contains, the instance will fail to 
start.
 
 ## Restore from backup
 
-At this point the scheduler is ready to rehydrate from the backup:
+* Run the `recovery-tool`. Wherever the flags match those used for the 
scheduler instance,
+use the same values:
+```
+$ recovery-tool -from BACKUP \
+-to LOG \
+-backup=<selected_backup_location> \
+-native_log_zk_group_path=<native_log_zk_group_path> \
+-native_log_file_path=<native_log_file_path> \
+-zk_endpoints=<zk_endpoints>
+```
 
-* Identify the leading scheduler by:
-  * examining the `scheduler_lifecycle_LEADER_AWAITING_REGISTRATION` metric at 
the scheduler
-    `/vars` endpoint. Leader will have 1. All other replicas - 0.
-  * examining scheduler logs
-  * or examining Zookeeper registration under the path defined by 
`-zk_endpoints`
-    and `-serverset_path`
+## Bring scheduler instances back online
 
-* Locate the desired backup file, copy it to the leading scheduler's 
`-backup_dir` folder and stage
-recovery by running the following command on a leader
-`aurora_admin scheduler_stage_recovery --bypass-leader-redirect <cluster> 
scheduler-backup-<yyyy-MM-dd-HH-mm>`
+### If running in HA Mode
 
-* At this point, the recovery snapshot is staged and available for manual 
verification/modification
-via `aurora_admin scheduler_print_recovery_tasks --bypass-leader-redirect` and
-`scheduler_delete_recovery_tasks --bypass-leader-redirect` commands.
-See `aurora_admin help <command>` for usage details.
+* Start the rehydrated scheduler instance along with enough cleaned up 
instances to
+meet the `-native_log_quorum_size`. The mesos-replicated log algorithm will 
replenish
+the "blank" scheduler instances with the information from the rehydrated 
instance.
+
+* Start any remaining scheduler instances.
+
+### If running in singleton mode
+
+* Start the single scheduler instance.
 
-* Commit recovery. This instructs the scheduler to overwrite the existing 
Mesos replicated log with
-the provided backup snapshot and initiate a mandatory failover
-`aurora_admin scheduler_commit_recovery --bypass-leader-redirect  <cluster>`
 
-## Cleanup
-Undo any modification done during [Preparation](#preparation) sequence.

Reply via email to