[
https://issues.apache.org/jira/browse/AURORA-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130751#comment-15130751
]
John Sirois edited comment on AURORA-1605 at 2/3/16 6:00 PM:
-------------------------------------------------------------
I went through the docs using test_kerberos_end_to_end.sh and hit a few
roadblocks / things that do not jive with the description in this ticket. I'm
sure I'm missing obvious things, but if not, my experience is detailed below.
h5. Setup environment to test recovery
# I edited test_kerberos_end_to_end.sh to skip tear-down and then ran it to
setup the kerberized scheduler
# I ssh'd to vagrant and ran the steps through setup manually to get kinit'd as
root for the aurora_admin commands I'd need to run, ie roughly:
{noformat}
cd ~/krb5-1.13.1/build
make testrealm
SCHEDULER_HOSTNAME=aurora.local
kadmin.local -q "addprinc -randkey HTTP/$SCHEDULER_HOSTNAME"
rm -f testdir/HTTP-$SCHEDULER_HOSTNAME.keytab.keytab
kadmin.local -q "ktadd -keytab testdir/HTTP-$SCHEDULER_HOSTNAME.keytab
HTTP/$SCHEDULER_HOSTNAME"
kadmin.local -q "addprinc -randkey root"
rm -f testdir/root.keytab
kadmin.local -q "ktadd -keytab testdir/root.keytab root"
kinit -k -t "testdir/root.keytab" root
{noformat}
# aurora_admin scheduler_backup_now devcluster && aurora_admin
scheduler_list_backups devcluster
h5. Do a restore
I ran through the restore docs as with details below:
h6. Preparation
{noformat}
$ diff /etc/init/aurora-scheduler-kerberos.conf
/etc/init/aurora-scheduler-kerberos.pre-recovery.conf
42,44c42
< -mesos_master_address=zk://localhost:181/mesos/master \
< -max_registration_delay=365days \
< -reconciliation_initial_delay=365days \
---
> -mesos_master_address=zk://localhost:2181/mesos/master \
{noformat}
h6. Restore from backup
The leading scheduler could only be identifed via logs:
{noformat}
sudo grep "Elected as leading scheduler"
/var/log/upstart/aurora-scheduler-kerberos.log | tail -1
I0203 16:57:05.336 [main, SchedulerLifecycle$5:238] Elected as leading
scheduler!
{noformat}
or examining zk nodes:
{noformat}
/usr/share/zookeeper/bin/zkCli.sh ls /aurora/scheduler
...
Connecting to localhost:2181
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
[singleton_candidate_0000000027]
/usr/share/zookeeper/bin/zkCli.sh get
/aurora/scheduler/singleton_candidate_0000000027
...
Connecting to localhost:2181
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
127.0.1.1
cZxid = 0x17b
ctime = Wed Feb 03 17:12:43 UTC 2016
mZxid = 0x17b
mtime = Wed Feb 03 17:12:43 UTC 2016
pZxid = 0x17b
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x152a7d138160090
dataLength = 9
numChildren = 0
{noformat}
All aurora_admin commands fail at this point though with this flavor (ie:
{{aurora_admin get_scheduler}}, {{aurora_admin scheduler_list_backups}}, etc.) :
{noformat}
aurora_admin scheduler_stage_recovery -v --bypass-leader-redirect devcluster
scheduler-backup-2016-02-03-16-32
DEBUG] Using auth module:
<apache.aurora.kerberos.auth_module.KerberosAuthModule object at 0x2b628a0b6290>
INFO] Connecting to 192.168.33.7:2181
INFO] Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=0,
time_out=10000, session_id=0,
passwd='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
read_only=None)
INFO] Zookeeper connection established, state: CONNECTED
INFO] Sending request(xid=1): GetChildren(path=u'/aurora/scheduler',
watcher=<function get_watch at 0x2b628a0d6488>)
INFO] Received response(xid=1): [u'singleton_candidate_0000000022']
INFO] Sending request(xid=2): GetChildren(path=u'/aurora/scheduler',
watcher=None)
INFO] Received response(xid=2): [u'singleton_candidate_0000000022']
WARN] Could not connect to scheduler: No schedulers detected in devcluster!
{noformat}
As a result, the only way to complete the rest of the guide was to re-edit
{{/etc/init/aurora-scheduler-kerberos.conf}} and restore the correct
{{-mesos_master_address}}. After doing this and bouncing the scheduler I could
run aurora_admin commands and successfully complete the restore via the rest of
the guide.
So... it seems to me the guide needs to - at a high-level, suggest:
# All schedulers are stopped (say 5 of them).
# All but one scheduler (4 in this example) are prepared as in "Preparation",
but 1 scheduler is prepared as in "Preparation" except for the bit about
setting an invalid {{-mesos_master_address}} and with the addition of
emphasizing the bit about port-blocking to prevent user-activity. This special
scheduler will be used to run the recovery staging, review and commit.
If I have this approximately right, I concure with [~StephanErb]'s second
comment above - the 1st "Identify the leading scheduler by" will then always
work, ie {{aurora_admin get_scheduler}} - but its beside the point since the
preparation already singled out a leader to run the recovery against.
This leads me to think the purpose of the "Identify the leading scheduler by"
section is to find the _last_ leading scheduler before recovery operations to
then go to that machine and find the latest backup file. That file is then
copied over to the recovery leading scheduler.
was (Author: jsirois):
I went through the docs using test_kerberos_end_to_end.sh and hit a few
roadblocks / things that do not jive with the description in this ticket. I'm
sure I'm missing obvious things, but if not, my experience is detailed below.
h5. Setup environment to test recovery
# I edited test_kerberos_end_to_end.sh to skip tear-down and then ran it to
setup the kerberized scheduler
# I ssh'd to vagrant and ran the steps through setup manually to get kinit'd as
root for the aurora_admin commands I'd need to run, ie roughly:
{noformat}
cd ~/krb5-1.13.1/build
make testrealm
SCHEDULER_HOSTNAME=aurora.local
kadmin.local -q "addprinc -randkey HTTP/$SCHEDULER_HOSTNAME"
rm -f testdir/HTTP-$SCHEDULER_HOSTNAME.keytab.keytab
kadmin.local -q "ktadd -keytab testdir/HTTP-$SCHEDULER_HOSTNAME.keytab
HTTP/$SCHEDULER_HOSTNAME"
kadmin.local -q "addprinc -randkey root"
rm -f testdir/root.keytab
kadmin.local -q "ktadd -keytab testdir/root.keytab root"
kinit -k -t "testdir/root.keytab" root
{noformat}
# aurora_admin scheduler_backup_now devcluster && aurora_admin
scheduler_list_backups devcluster
h5. Do a restore
I ran through the restore docs as with details below:
h6. Preparation
{noformat}
$ diff /etc/init/aurora-scheduler-kerberos.conf
/etc/init/aurora-scheduler-kerberos.pre-recovery.conf
42,44c42
< -mesos_master_address=zk://localhost:181/mesos/master \
< -max_registration_delay=365days \
< -reconciliation_initial_delay=365days \
---
> -mesos_master_address=zk://localhost:2181/mesos/master \
{noformat}
h6. Restore from backup
The leading scheduler could only be identifed via logs:
{noformat}
sudo grep "Elected as leading scheduler"
/var/log/upstart/aurora-scheduler-kerberos.log | tail -1
I0203 16:57:05.336 [main, SchedulerLifecycle$5:238] Elected as leading
scheduler!
{noformat}
or examining zk nodes:
{noformat}
/usr/share/zookeeper/bin/zkCli.sh ls /aurora/scheduler
...
Connecting to localhost:2181
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
[singleton_candidate_0000000027]
/usr/share/zookeeper/bin/zkCli.sh get
/aurora/scheduler/singleton_candidate_0000000027
...
Connecting to localhost:2181
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
127.0.1.1
cZxid = 0x17b
ctime = Wed Feb 03 17:12:43 UTC 2016
mZxid = 0x17b
mtime = Wed Feb 03 17:12:43 UTC 2016
pZxid = 0x17b
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x152a7d138160090
dataLength = 9
numChildren = 0
{noformat}
All aurora_admin commands fail at this point though with this flavor (ie:
{{aurora_admin get_scheduler}}, {{aurora_admin scheduler_list_backups}}, etc.) :
{noformat}
aurora_admin scheduler_stage_recovery -v --bypass-leader-redirect devcluster
scheduler-backup-2016-02-03-16-32
DEBUG] Using auth module:
<apache.aurora.kerberos.auth_module.KerberosAuthModule object at 0x2b628a0b6290>
INFO] Connecting to 192.168.33.7:2181
INFO] Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=0,
time_out=10000, session_id=0,
passwd='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
read_only=None)
INFO] Zookeeper connection established, state: CONNECTED
INFO] Sending request(xid=1): GetChildren(path=u'/aurora/scheduler',
watcher=<function get_watch at 0x2b628a0d6488>)
INFO] Received response(xid=1): [u'singleton_candidate_0000000022']
INFO] Sending request(xid=2): GetChildren(path=u'/aurora/scheduler',
watcher=None)
INFO] Received response(xid=2): [u'singleton_candidate_0000000022']
WARN] Could not connect to scheduler: No schedulers detected in devcluster!
{noformat}
As a result, the only way to complete the rest of the guide was to re-edit
{{/etc/init/aurora-scheduler-kerberos.conf}} and restore the correct
{{-mesos_master_address}}. After doing this and bouncing the scheduler I could
run aurora_admin commands and successfully complete the restore via the rest of
the guide.
So... it seems to me the guide needs to - at a high-level, suggest:
# All schedulers are stopped (say 5 of them).
# All but one scheduler (4 in this example) as in "Preparation", but 1
scheduler is prepared as in "Preparation" except for the bit about setting an
invalid {{-mesos_master_address}} and with the addition of emphasizing the bit
about port-blocking to prevent user-activity. This special scheduler will be
used to run the recovery staging, review and commit.
If I have this approximately right, I concure with [~StephanErb]'s second
comment above - the 1st "Identify the leading scheduler by" will then always
work, ie {{aurora_admin get_scheduler}} - but its beside the point since the
preparation already singled out a leader to run the recovery against.
This leads me to think the purpose of the "Identify the leading scheduler by"
section is to find the _last_ leading scheduler before recovery operations to
then go to that machine and find the latest backup file. That file is then
copied over to the recovery leading scheduler.
> Update recovery docs to reflect changes
> ---------------------------------------
>
> Key: AURORA-1605
> URL: https://issues.apache.org/jira/browse/AURORA-1605
> Project: Aurora
> Issue Type: Task
> Components: Documentation
> Reporter: Joshua Cohen
> Priority: Minor
>
> We had to restore one of our clusters from backup recently, and it turns out
> there's been some drift between the [documented
> process](https://github.com/apache/aurora/blob/f630bf705ac8a9de2b7b987858ada3b876f65abf/docs/storage-config.md#recovering-from-a-scheduler-backup)
> and what's currently necessary.
> Specifically, we needed to disable the leader redirect filter and, I believe,
> mesos authentication.
> We should make sure the recovery docs are up to date with what's actually
> required.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)