Re: [DISCUSS] Plan for Distributed testing of Backup and Restore

Andrew Purtell Tue, 12 Sep 2017 11:53:20 -0700

> making backup working in challenging conditions was not a goal of FT
design, correct failure handling was a goal.

Every real-world production environment has challenging conditions.

That said, making progress in the face of failures is only one aspect of
FT, and an equally valid one is that failures do not cause data corruption.

If testing with chaos proves this backup solution will fail if there is any
failure while backup is in progress, but at least it will successfully
clean up and not corrupt existing state - that could be ok, for some.
Possibly, us.

If testing with chaos proves this backup solution will not suffer
corruption if there is a failure *and* can still successfully complete if
there is any failure while backup is in progress - that would obviously
improve the perceived value proposition.

It would be fine to test this using hbase-it chaos facilities but with a
less aggressive policy than slowDeterministic that allows for backups to
successfully complete once in a while yet also demonstrate that when the
failures do happen things are properly cleaned up and data corruption does
not happen.

On Tue, Sep 12, 2017 at 11:25 AM, Vladimir Rodionov <[email protected]>
wrote:

> >> Vlad: I'm obviously curious to see what you think about this stuff, in
> addition to what you already had in mind :)
>
> Yes, I think that we need a test tool similar to ITBLL. Btw, making backup
> working in challenging conditions was not a goal of FT design, correct
> failure handling was a goal.
>
> On Tue, Sep 12, 2017 at 9:53 AM, Josh Elser <[email protected]> wrote:
>
> > Thanks for the quick feedback!
> >
> > On 9/12/17 12:36 PM, Stack wrote:
> >
> >> On Tue, Sep 12, 2017 at 9:33 AM, Andrew Purtell <
> [email protected]
> >> >
> >> wrote:
> >>
> >> I think those are reasonable criteria Josh.
> >>>
> >>> What I would like to see is something like "we ran ITBLL (or custom
> >>> generator with similar correctness validation if you prefer) on a dev
> >>> cluster (5-10 nodes) for 24 hours with server killing chaos agents
> >>> active,
> >>> attempted 1,440 backups (one per minute), of which 1,000 succeeded and
> >>> 100%
> >>> if these were successfully restored and validated." This implies your
> >>> points on automation and no manual intervention. Maybe the number of
> >>> successful backups under challenging conditions will be lower. Point is
> >>> they demonstrate we can rely on it even when a cluster is partially
> >>> unhealthy, which in production is often the normal order of affairs.
> >>>
> >>>
> >>>
> > I like it. I hadn't thought about stressing quite this aggressively, but
> > now that I think about it, sounds like a great plan. Having some ballpark
> > measure to quantify the cost of a "backup-heavy" workload would be cool
> in
> > addition to seeing how the system reacts in unexpected manners.
> >
> > Sounds good to me.
> >>
> >> How will you test the restore aspect? After 1k (or whatever makes sense)
> >> incremental backups over the life of the chaos, could you restore and
> >> validate that the table had all expected data in place.
> >>
> >
> > Exactly. My thinking was that, at any point, we should be able to do a
> > restore and validate. Maybe something like: every Nth ITBLL iteration,
> make
> > a new backup point, restore a previous backup point, verify, restore to
> > newest backup point. The previous backup point should be a full or
> > incremental point.
> >
> > Vlad: I'm obviously curious to see what you think about this stuff, in
> > addition to what you already had in mind :)
> >
>

-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: [DISCUSS] Plan for Distributed testing of Backup and Restore

Reply via email to