aurora git commit: Updating scheduler backup restore instructions.

maxim Fri, 04 Mar 2016 09:18:12 -0800

Repository: aurora
Updated Branches:
  refs/heads/master 3a3415e1c -> f6506d903



Updating scheduler backup restore instructions.

Bugs closed: AURORA-1605

Reviewed at https://reviews.apache.org/r/43622/


Project: http://git-wip-us.apache.org/repos/asf/aurora/repo
Commit: http://git-wip-us.apache.org/repos/asf/aurora/commit/f6506d90
Tree: http://git-wip-us.apache.org/repos/asf/aurora/tree/f6506d90
Diff: http://git-wip-us.apache.org/repos/asf/aurora/diff/f6506d90

Branch: refs/heads/master
Commit: f6506d90378c7fc1fde7ba7fc170393cdc94d8b6
Parents: 3a3415e
Author: Maxim Khutornenko <[email protected]>
Authored: Fri Mar 4 09:16:52 2016 -0800
Committer: Maxim Khutornenko <[email protected]>
Committed: Fri Mar 4 09:16:52 2016 -0800

----------------------------------------------------------------------
 docs/client-commands.md    |  2 +-
 docs/storage-config.md     | 47 +++++++++++++++++++++++++----------------
 docs/thrift-deprecation.md | 10 ++++++---
 3 files changed, 37 insertions(+), 22 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/aurora/blob/f6506d90/docs/client-commands.md
----------------------------------------------------------------------
diff --git a/docs/client-commands.md b/docs/client-commands.md
index fe3ee56..156fe4c 100644
--- a/docs/client-commands.md
+++ b/docs/client-commands.md
@@ -69,7 +69,7 @@ communicates with a single (non-leader-elected) scheduler.  
For example:
 ```javascript
 [{
   "name": "example",
-  "scheduler_uri": "localhost:55555",
+  "scheduler_uri": "http://localhost:55555";,
 }]
 ```
 

http://git-wip-us.apache.org/repos/asf/aurora/blob/f6506d90/docs/storage-config.md
----------------------------------------------------------------------
diff --git a/docs/storage-config.md b/docs/storage-config.md
index c838ea3..7c64841 100644
--- a/docs/storage-config.md
+++ b/docs/storage-config.md
@@ -61,12 +61,6 @@ Maximum number of backups to retain before deleting the 
oldest backup(s).
 
 ## Recovering from a scheduler backup
 
-- [Overview](#overview)
-- [Preparation](#preparation)
-- [Assess Mesos replicated log damage](#assess-mesos-replicated-log-damage)
-- [Restore from backup](#restore-from-backup)
-- [Cleanup](#cleanup)
-
 **Be sure to read the entire page before attempting to restore from a backup, 
as it may have
 unintended consequences.**
 
@@ -82,6 +76,9 @@ Usually, it is a bad idea to restore a backup that is not 
extremely recent (i.e.
 hours). This is because the scheduler will expect the cluster to look exactly 
as the backup does,
 so any tasks that have been rescheduled since the backup was taken will be 
killed.
 
+Instructions below have been verified in [Vagrant environment](vagrant.md) and 
with minor
+syntax/path changes should be applicable to any Aurora cluster.
+
 ### Preparation
 
 Follow these steps to prepare the cluster for restoring from a backup:
@@ -91,13 +88,25 @@ Follow these steps to prepare the cluster for restoring 
from a backup:
 * Consider blocking external traffic on a port defined in `-http_port` for all 
schedulers to
 prevent users from interacting with the scheduler during the restoration 
process. This will help
 troubleshooting by reducing the scheduler log noise and prevent users from 
making changes that will
-be erased after the backup snapshot is restored
+be erased after the backup snapshot is restored.
+
+* Configure `aurora_admin` access to run all commands listed in
+  [Restore from backup](#restore-from-backup) section locally on the leading 
scheduler:
+  * Make sure the [clusters.json](client-commands.md#cluster-configuration) 
file configured to
+    access scheduler directly. Set `scheduler_uri` setting and remove `zk`. 
Since leader can get
+    re-elected during the restore steps, consider doing it on all scheduler 
replicas.
+  * Depending on your particular security approach you will need to either 
turn off scheduler
+    authorization by removing scheduler `-http_authentication_mechanism` flag 
or make sure the
+    direct scheduler access is properly authorized. E.g.: in case of Kerberos 
you will need to make
+    a `/etc/hosts` file change to match your local IP to the scheduler URL 
configured in keytabs:
+
+        <local_ip> <scheduler_domain_in_keytabs>
 
 * Next steps are required to put scheduler into a partially disabled state 
where it would still be
 able to accept storage recovery requests but unable to schedule or change task 
states. This may be
 accomplished by updating the following scheduler configuration options:
   * Set `-mesos_master_address` to a non-existent zk address. This will 
prevent scheduler from
-    registering with Mesos. E.g.: `-mesos_master_address=zk://localhost:2181`
+    registering with Mesos. E.g.: 
`-mesos_master_address=zk://localhost:1111/mesos/master`
   * `-max_registration_delay` - set to sufficiently long interval to prevent 
registration timeout
     and as a result scheduler suicide. E.g: `-max_registration_delay=360mins`
   * Make sure `-reconciliation_initial_delay` option is set high enough (e.g.: 
`365days`) to
@@ -108,34 +117,36 @@ accomplished by updating the following scheduler 
configuration options:
 
 ### Cleanup and re-initialize Mesos replicated log
 
-Get rid of the corrupted files and re-initialize Mesos replicate log:
+Get rid of the corrupted files and re-initialize Mesos replicated log:
 
 * Stop schedulers
 * Delete all files under `-native_log_file_path` on all schedulers
-* Initialize Mesos replica's log file: `mesos-log initialize 
--path=<-native_log_file_path>`
-* Restart schedulers
+* Initialize Mesos replica's log file: `sudo mesos-log initialize 
--path=<-native_log_file_path>`
+* Start schedulers
 
 ### Restore from backup
 
 At this point the scheduler is ready to rehydrate from the backup:
 
 * Identify the leading scheduler by:
-  * running `aurora_admin get_scheduler <cluster>` - if scheduler is responsive
+  * examining the `scheduler_lifecycle_LEADER_AWAITING_REGISTRATION` metric at 
the scheduler
+    `/vars` endpoint. Leader will have 1. All other replicas - 0.
   * examining scheduler logs
   * or examining Zookeeper registration under the path defined by 
`-zk_endpoints`
     and `-serverset_path`
 
-* Locate the desired backup file, copy it to the leading scheduler and stage 
recovery by running
-the following command on a leader
-`aurora_admin scheduler_stage_recovery <cluster> 
scheduler-backup-<yyyy-MM-dd-HH-mm>`
+* Locate the desired backup file, copy it to the leading scheduler's 
`-backup_dir` folder and stage
+recovery by running the following command on a leader
+`aurora_admin scheduler_stage_recovery --bypass-leader-redirect <cluster> 
scheduler-backup-<yyyy-MM-dd-HH-mm>`
 
 * At this point, the recovery snapshot is staged and available for manual 
verification/modification
-via `aurora_admin scheduler_print_recovery_tasks` and 
`scheduler_delete_recovery_tasks` commands.
+via `aurora_admin scheduler_print_recovery_tasks --bypass-leader-redirect` and
+`scheduler_delete_recovery_tasks --bypass-leader-redirect` commands.
 See `aurora_admin help <command>` for usage details.
 
-* Commit recovery. This instructs the scheduler to overwrite the existing 
Mesosreplicated log with
+* Commit recovery. This instructs the scheduler to overwrite the existing 
Mesos replicated log with
 the provided backup snapshot and initiate a mandatory failover
-`aurora_admin scheduler_commit_recovery <cluster>`
+`aurora_admin scheduler_commit_recovery --bypass-leader-redirect  <cluster>`
 
 ### Cleanup
 Undo any modification done during [Preparation](#preparation) sequence.

http://git-wip-us.apache.org/repos/asf/aurora/blob/f6506d90/docs/thrift-deprecation.md
----------------------------------------------------------------------
diff --git a/docs/thrift-deprecation.md b/docs/thrift-deprecation.md
index e1f1fbc..62a71bc 100644
--- a/docs/thrift-deprecation.md
+++ b/docs/thrift-deprecation.md
@@ -29,17 +29,21 @@ Change is applied in a way that does not break 
scheduler/client with this versio
 communicate with scheduler/client from vCurrent-1.
 * Do not remove or rename the old field
 * Add a new field as an eventual replacement of the old one and implement a 
dual read/write
-anywhere the old field is used
+anywhere the old field is used. If a thrift struct is mapped in the DB store 
make sure both columns
+are marked as `NOT NULL`
 * Check 
[storage.thrift](../api/src/main/thrift/org/apache/aurora/gen/storage.thrift) 
to see if the
 affected struct is stored in Aurora scheduler storage. If so, you most likely 
need to backfill
-existing data to ensure both fields are populated eagerly on startup
-See 
[StorageBackfill.java](../src/main/java/org/apache/aurora/scheduler/storage/StorageBackfill.java)
+existing data to ensure both fields are populated eagerly on startup. See
+[this patch](https://reviews.apache.org/r/43172) as a real-life example of 
thrift-struct
+backfilling. IMPORTANT: backfilling implementation needs to ensure both fields 
are populated. This
+is critical to enable graceful scheduler upgrade as well as rollback to the 
old version if needed.
 * Add a deprecation jira ticket into the vCurrent+1 release candidate
 * Add a TODO for the deprecated field mentioning the jira ticket
 
 ### vCurrent+1
 Finalize the change by removing the deprecated fields from the Thrift schema.
 * Drop any dual read/write routines added in the previous version
+* Remove thrift backfilling in scheduler
 * Remove the deprecated Thrift field
 
 ## Testing

aurora git commit: Updating scheduler backup restore instructions.

Reply via email to