[jira] [Comment Edited] (AURORA-1605) Update recovery docs to reflect changes

John Sirois (JIRA) Wed, 03 Feb 2016 10:01:36 -0800

    [ 
https://issues.apache.org/jira/browse/AURORA-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130751#comment-15130751
 ]


John Sirois edited comment on AURORA-1605 at 2/3/16 6:00 PM:
-------------------------------------------------------------

I went through the docs using test_kerberos_end_to_end.sh and hit a few 
roadblocks / things that do not jive with the description in this ticket.  I'm 
sure I'm missing obvious things, but if not, my experience is detailed below.

h5. Setup environment to test recovery
# I edited test_kerberos_end_to_end.sh to skip tear-down and then ran it to 
setup the kerberized scheduler
# I ssh'd to vagrant and ran the steps through setup manually to get kinit'd as 
root for the aurora_admin commands I'd need to run, ie roughly:
{noformat}
cd ~/krb5-1.13.1/build
make testrealm
SCHEDULER_HOSTNAME=aurora.local
kadmin.local -q "addprinc -randkey HTTP/$SCHEDULER_HOSTNAME"
rm -f testdir/HTTP-$SCHEDULER_HOSTNAME.keytab.keytab
kadmin.local -q "ktadd -keytab testdir/HTTP-$SCHEDULER_HOSTNAME.keytab 
HTTP/$SCHEDULER_HOSTNAME"
kadmin.local -q "addprinc -randkey root"
rm -f testdir/root.keytab
kadmin.local -q "ktadd -keytab testdir/root.keytab root"
kinit -k -t "testdir/root.keytab" root
{noformat}
# aurora_admin scheduler_backup_now devcluster && aurora_admin 
scheduler_list_backups devcluster

h5. Do a restore

I ran through the restore docs as with details below:

h6. Preparation

{noformat}
$ diff /etc/init/aurora-scheduler-kerberos.conf 
/etc/init/aurora-scheduler-kerberos.pre-recovery.conf 
42,44c42
<   -mesos_master_address=zk://localhost:181/mesos/master \
<   -max_registration_delay=365days \
<   -reconciliation_initial_delay=365days \
---
>   -mesos_master_address=zk://localhost:2181/mesos/master \
{noformat}

h6. Restore from backup

The leading scheduler could only be identifed via logs:
{noformat}
sudo grep "Elected as leading scheduler" 
/var/log/upstart/aurora-scheduler-kerberos.log | tail -1
    I0203 16:57:05.336 [main, SchedulerLifecycle$5:238] Elected as leading 
scheduler!
{noformat}
or examining zk nodes:
{noformat}
/usr/share/zookeeper/bin/zkCli.sh ls /aurora/scheduler                          
     
...
Connecting to localhost:2181

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
[singleton_candidate_0000000027]
/usr/share/zookeeper/bin/zkCli.sh get 
/aurora/scheduler/singleton_candidate_0000000027
...
Connecting to localhost:2181

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
127.0.1.1
cZxid = 0x17b
ctime = Wed Feb 03 17:12:43 UTC 2016
mZxid = 0x17b
mtime = Wed Feb 03 17:12:43 UTC 2016
pZxid = 0x17b
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x152a7d138160090
dataLength = 9
numChildren = 0
{noformat}
All aurora_admin commands fail at this point though with this flavor (ie: 
{{aurora_admin get_scheduler}}, {{aurora_admin scheduler_list_backups}}, etc.) :
{noformat}
aurora_admin scheduler_stage_recovery -v --bypass-leader-redirect devcluster 
scheduler-backup-2016-02-03-16-32
DEBUG] Using auth module: 
<apache.aurora.kerberos.auth_module.KerberosAuthModule object at 0x2b628a0b6290>
 INFO] Connecting to 192.168.33.7:2181
 INFO] Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=0, 
time_out=10000, session_id=0, 
passwd='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 
read_only=None)
 INFO] Zookeeper connection established, state: CONNECTED
 INFO] Sending request(xid=1): GetChildren(path=u'/aurora/scheduler', 
watcher=<function get_watch at 0x2b628a0d6488>)
 INFO] Received response(xid=1): [u'singleton_candidate_0000000022']
 INFO] Sending request(xid=2): GetChildren(path=u'/aurora/scheduler', 
watcher=None)
 INFO] Received response(xid=2): [u'singleton_candidate_0000000022']
 WARN] Could not connect to scheduler: No schedulers detected in devcluster!
{noformat}

As a result, the only way to complete the rest of the guide was to re-edit 
{{/etc/init/aurora-scheduler-kerberos.conf}} and restore the correct 
{{-mesos_master_address}}.  After doing this and bouncing the scheduler I could 
run aurora_admin commands and successfully complete the restore via the rest of 
the guide.

So... it seems to me the guide needs to - at a high-level, suggest:
# All schedulers are stopped (say 5 of them).
# All but one scheduler (4 in this example) are prepared as in "Preparation", 
but 1 scheduler is prepared as in "Preparation" except for the bit about 
setting an invalid {{-mesos_master_address}} and with the addition of 
emphasizing the bit about port-blocking to prevent user-activity.  This special 
scheduler will be used to run the recovery staging, review and commit.

If I have this approximately right, I concure with [~StephanErb]'s second 
comment above - the 1st "Identify the leading scheduler by" will then always 
work, ie {{aurora_admin get_scheduler}} - but its beside the point since the 
preparation already singled out a leader to run the recovery against.

This leads me to think the purpose of the "Identify the leading scheduler by" 
section is to find the _last_ leading scheduler before recovery operations to 
then go to that machine and find the latest backup file.  That file is then 
copied over to the recovery leading scheduler.


was (Author: jsirois):
I went through the docs using test_kerberos_end_to_end.sh and hit a few 
roadblocks / things that do not jive with the description in this ticket.  I'm 
sure I'm missing obvious things, but if not, my experience is detailed below.

h5. Setup environment to test recovery
# I edited test_kerberos_end_to_end.sh to skip tear-down and then ran it to 
setup the kerberized scheduler
# I ssh'd to vagrant and ran the steps through setup manually to get kinit'd as 
root for the aurora_admin commands I'd need to run, ie roughly:
{noformat}
cd ~/krb5-1.13.1/build
make testrealm
SCHEDULER_HOSTNAME=aurora.local
kadmin.local -q "addprinc -randkey HTTP/$SCHEDULER_HOSTNAME"
rm -f testdir/HTTP-$SCHEDULER_HOSTNAME.keytab.keytab
kadmin.local -q "ktadd -keytab testdir/HTTP-$SCHEDULER_HOSTNAME.keytab 
HTTP/$SCHEDULER_HOSTNAME"
kadmin.local -q "addprinc -randkey root"
rm -f testdir/root.keytab
kadmin.local -q "ktadd -keytab testdir/root.keytab root"
kinit -k -t "testdir/root.keytab" root
{noformat}
# aurora_admin scheduler_backup_now devcluster && aurora_admin 
scheduler_list_backups devcluster

h5. Do a restore

I ran through the restore docs as with details below:

h6. Preparation

{noformat}
$ diff /etc/init/aurora-scheduler-kerberos.conf 
/etc/init/aurora-scheduler-kerberos.pre-recovery.conf 
42,44c42
<   -mesos_master_address=zk://localhost:181/mesos/master \
<   -max_registration_delay=365days \
<   -reconciliation_initial_delay=365days \
---
>   -mesos_master_address=zk://localhost:2181/mesos/master \
{noformat}

h6. Restore from backup

The leading scheduler could only be identifed via logs:
{noformat}
sudo grep "Elected as leading scheduler" 
/var/log/upstart/aurora-scheduler-kerberos.log | tail -1
    I0203 16:57:05.336 [main, SchedulerLifecycle$5:238] Elected as leading 
scheduler!
{noformat}
or examining zk nodes:
{noformat}
/usr/share/zookeeper/bin/zkCli.sh ls /aurora/scheduler                          
     
...
Connecting to localhost:2181

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
[singleton_candidate_0000000027]
/usr/share/zookeeper/bin/zkCli.sh get 
/aurora/scheduler/singleton_candidate_0000000027
...
Connecting to localhost:2181

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
127.0.1.1
cZxid = 0x17b
ctime = Wed Feb 03 17:12:43 UTC 2016
mZxid = 0x17b
mtime = Wed Feb 03 17:12:43 UTC 2016
pZxid = 0x17b
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x152a7d138160090
dataLength = 9
numChildren = 0
{noformat}
All aurora_admin commands fail at this point though with this flavor (ie: 
{{aurora_admin get_scheduler}}, {{aurora_admin scheduler_list_backups}}, etc.) :
{noformat}
aurora_admin scheduler_stage_recovery -v --bypass-leader-redirect devcluster 
scheduler-backup-2016-02-03-16-32
DEBUG] Using auth module: 
<apache.aurora.kerberos.auth_module.KerberosAuthModule object at 0x2b628a0b6290>
 INFO] Connecting to 192.168.33.7:2181
 INFO] Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=0, 
time_out=10000, session_id=0, 
passwd='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 
read_only=None)
 INFO] Zookeeper connection established, state: CONNECTED
 INFO] Sending request(xid=1): GetChildren(path=u'/aurora/scheduler', 
watcher=<function get_watch at 0x2b628a0d6488>)
 INFO] Received response(xid=1): [u'singleton_candidate_0000000022']
 INFO] Sending request(xid=2): GetChildren(path=u'/aurora/scheduler', 
watcher=None)
 INFO] Received response(xid=2): [u'singleton_candidate_0000000022']
 WARN] Could not connect to scheduler: No schedulers detected in devcluster!
{noformat}

As a result, the only way to complete the rest of the guide was to re-edit 
{{/etc/init/aurora-scheduler-kerberos.conf}} and restore the correct 
{{-mesos_master_address}}.  After doing this and bouncing the scheduler I could 
run aurora_admin commands and successfully complete the restore via the rest of 
the guide.

So... it seems to me the guide needs to - at a high-level, suggest:
# All schedulers are stopped (say 5 of them).
# All but one scheduler (4 in this example) as in "Preparation", but 1 
scheduler is prepared as in "Preparation" except for the bit about setting an 
invalid {{-mesos_master_address}} and with the addition of emphasizing the bit 
about port-blocking to prevent user-activity.  This special scheduler will be 
used to run the recovery staging, review and commit.

If I have this approximately right, I concure with [~StephanErb]'s second 
comment above - the 1st "Identify the leading scheduler by" will then always 
work, ie {{aurora_admin get_scheduler}} - but its beside the point since the 
preparation already singled out a leader to run the recovery against.

This leads me to think the purpose of the "Identify the leading scheduler by" 
section is to find the _last_ leading scheduler before recovery operations to 
then go to that machine and find the latest backup file.  That file is then 
copied over to the recovery leading scheduler.

> Update recovery docs to reflect changes
> ---------------------------------------
>
>                 Key: AURORA-1605
>                 URL: https://issues.apache.org/jira/browse/AURORA-1605
>             Project: Aurora
>          Issue Type: Task
>          Components: Documentation
>            Reporter: Joshua Cohen
>            Priority: Minor
>
> We had to restore one of our clusters from backup recently, and it turns out 
> there's been some drift between the [documented 
> process](https://github.com/apache/aurora/blob/f630bf705ac8a9de2b7b987858ada3b876f65abf/docs/storage-config.md#recovering-from-a-scheduler-backup)
>  and what's currently necessary.
> Specifically, we needed to disable the leader redirect filter and, I believe, 
> mesos authentication.
> We should make sure the recovery docs are up to date with what's actually 
> required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (AURORA-1605) Update recovery docs to reflect changes

Reply via email to