Re: Recovery instructions updates

2018-06-03 Thread Meghdoot bhattacharya
We will try to recover the log files on the snapshot loading error.

+ 1 to Bill’s approach on making offline recovery. We will try the patch on our 
side.

Renan, I would ask you to prepare a PR for the restoration docs proposing the 2 
additional steps required in current world as we look to maybe using a 
different mechanism. The prep steps to get scheduler ready for backup can be 
eliminated hopefully with the alternative approach.

On side lets see if we can recover the logs of the corrupted snapshot loading.


Thx

> On Jun 3, 2018, at 9:50 AM, Stephan Erb  wrote:
> 
> That sounds indeed concerning. Would be great if you could file an issue and 
> attach the related log files and tracebacks.
> 
> Bill recently added a potential replacement for the existing restore 
> mechanism: 
> https://github.com/apache/aurora/commit/2e1ca42887bc8ea1e8c6cddebe9d1cf29268c714.
>  Given the set of issues you have bumped into with the current restore, this 
> new approach might be worth exploring further.
> 
> On 03.06.18, 08:43, "Meghdoot bhattacharya"  
> wrote:
> 
>Thx Renan for sharing the details. This backup restore happened under not 
> so easy circumstances, so would encourage the leads to keep docs updated as 
> much as possible and include in release validation.
> 
>The other issue of snapshots having task and other objects as nil that 
> causes to fail the schedulers, we have now seen 2 times in past year. Other 
> than finding root cause why that entry happens during snapshot creation, 
> there needs to be defensive code either to ignore that entry on loading or a 
> way to fix the snapshot. Because we might have to go through a days worth of 
> snapshots to find which one did not had that entry and recover from there. 
> Mean time to recover gets impacted under the circumstances. One extra info 
> not sure is relevant or not is the corrupted snapshot got created by the 
> admin cli (assumption should not matter whether scheduler triggers or forced 
> via cli) that showed success as well as the aurora logs but then loading it 
> exposed the issue.
> 
>Thx
> 
>> On Jun 2, 2018, at 3:54 PM, Renan DelValle  wrote:
>> 
>> Hi all,
>> 
>> We tried following the recovery instructions from
>> http://aurora.apache.org/documentation/latest/operations/backup-restore/
>> 
>> After our change from the Twitter commons ZK library to Apache Curator,
>> these instructions are no longer valid.
>> 
>> In order for Aurora to carry out a leader election in the current state,
>> Aurora has to first connect to a Mesos master. What we ended up doing was
>> connecting to Mesos master that was had nothing on it to bypass this new
>> requirement.
>> 
>> Next, wiping away -native_log_file_path did not seem to be enough to
>> recover from a corrupted mesos replicated log. We had to manually wipe away
>> entries in ZK and move the snapshot backup directory in order for the
>> leader to not fall back on either a snapshot or the mesos-log to rehydrate
>> the leader.
>> 
>> Finally, somehow triggering a manual snapshot generated a snapshot with an
>> invalid entry which then caused the scheduler to fail after a failover
>> while trying to catch up on current state.
>> 
>> We are trying to investigate why this took place (it could have been we
>> didn't give the system enough time to finish hydrating the snapshot), but
>> the invalid entry which looked something like a Task with all null or 0
>> values, caused our leaders to fail (which necessitated restoring from an
>> earlier snapshot) and note that this was only after we triggered the manual
>> snapshot and BEFORE we tried to restore.
>> 
>> Will report more details as they become available and will provide some doc
>> updates based on our experience.
>> 
>> -Renan
> 
> 
> 



Re: Recovery instructions updates

2018-06-03 Thread Stephan Erb
That sounds indeed concerning. Would be great if you could file an issue and 
attach the related log files and tracebacks.

Bill recently added a potential replacement for the existing restore mechanism: 
https://github.com/apache/aurora/commit/2e1ca42887bc8ea1e8c6cddebe9d1cf29268c714.
 Given the set of issues you have bumped into with the current restore, this 
new approach might be worth exploring further.

On 03.06.18, 08:43, "Meghdoot bhattacharya"  
wrote:

Thx Renan for sharing the details. This backup restore happened under not 
so easy circumstances, so would encourage the leads to keep docs updated as 
much as possible and include in release validation.

The other issue of snapshots having task and other objects as nil that 
causes to fail the schedulers, we have now seen 2 times in past year. Other 
than finding root cause why that entry happens during snapshot creation, there 
needs to be defensive code either to ignore that entry on loading or a way to 
fix the snapshot. Because we might have to go through a days worth of snapshots 
to find which one did not had that entry and recover from there. Mean time to 
recover gets impacted under the circumstances. One extra info not sure is 
relevant or not is the corrupted snapshot got created by the admin cli 
(assumption should not matter whether scheduler triggers or forced via cli) 
that showed success as well as the aurora logs but then loading it exposed the 
issue.

Thx

> On Jun 2, 2018, at 3:54 PM, Renan DelValle  
wrote:
> 
> Hi all,
> 
> We tried following the recovery instructions from
> http://aurora.apache.org/documentation/latest/operations/backup-restore/
> 
> After our change from the Twitter commons ZK library to Apache Curator,
> these instructions are no longer valid.
> 
> In order for Aurora to carry out a leader election in the current state,
> Aurora has to first connect to a Mesos master. What we ended up doing was
> connecting to Mesos master that was had nothing on it to bypass this new
> requirement.
> 
> Next, wiping away -native_log_file_path did not seem to be enough to
> recover from a corrupted mesos replicated log. We had to manually wipe 
away
> entries in ZK and move the snapshot backup directory in order for the
> leader to not fall back on either a snapshot or the mesos-log to rehydrate
> the leader.
> 
> Finally, somehow triggering a manual snapshot generated a snapshot with an
> invalid entry which then caused the scheduler to fail after a failover
> while trying to catch up on current state.
> 
> We are trying to investigate why this took place (it could have been we
> didn't give the system enough time to finish hydrating the snapshot), but
> the invalid entry which looked something like a Task with all null or 0
> values, caused our leaders to fail (which necessitated restoring from an
> earlier snapshot) and note that this was only after we triggered the 
manual
> snapshot and BEFORE we tried to restore.
> 
> Will report more details as they become available and will provide some 
doc
> updates based on our experience.
> 
> -Renan