Thanks for the quick feedback!

On 9/12/17 12:36 PM, Stack wrote:
On Tue, Sep 12, 2017 at 9:33 AM, Andrew Purtell <[email protected]>
wrote:

I think those are reasonable criteria Josh.

What I would like to see is something like "we ran ITBLL (or custom
generator with similar correctness validation if you prefer) on a dev
cluster (5-10 nodes) for 24 hours with server killing chaos agents active,
attempted 1,440 backups (one per minute), of which 1,000 succeeded and 100%
if these were successfully restored and validated." This implies your
points on automation and no manual intervention. Maybe the number of
successful backups under challenging conditions will be lower. Point is
they demonstrate we can rely on it even when a cluster is partially
unhealthy, which in production is often the normal order of affairs.



I like it. I hadn't thought about stressing quite this aggressively, but now that I think about it, sounds like a great plan. Having some ballpark measure to quantify the cost of a "backup-heavy" workload would be cool in addition to seeing how the system reacts in unexpected manners.

Sounds good to me.

How will you test the restore aspect? After 1k (or whatever makes sense)
incremental backups over the life of the chaos, could you restore and
validate that the table had all expected data in place.

Exactly. My thinking was that, at any point, we should be able to do a restore and validate. Maybe something like: every Nth ITBLL iteration, make a new backup point, restore a previous backup point, verify, restore to newest backup point. The previous backup point should be a full or incremental point.

Vlad: I'm obviously curious to see what you think about this stuff, in addition to what you already had in mind :)

Reply via email to