>> Vlad: I'm obviously curious to see what you think about this stuff, in addition to what you already had in mind :)
Yes, I think that we need a test tool similar to ITBLL. Btw, making backup working in challenging conditions was not a goal of FT design, correct failure handling was a goal. On Tue, Sep 12, 2017 at 9:53 AM, Josh Elser <[email protected]> wrote: > Thanks for the quick feedback! > > On 9/12/17 12:36 PM, Stack wrote: > >> On Tue, Sep 12, 2017 at 9:33 AM, Andrew Purtell <[email protected] >> > >> wrote: >> >> I think those are reasonable criteria Josh. >>> >>> What I would like to see is something like "we ran ITBLL (or custom >>> generator with similar correctness validation if you prefer) on a dev >>> cluster (5-10 nodes) for 24 hours with server killing chaos agents >>> active, >>> attempted 1,440 backups (one per minute), of which 1,000 succeeded and >>> 100% >>> if these were successfully restored and validated." This implies your >>> points on automation and no manual intervention. Maybe the number of >>> successful backups under challenging conditions will be lower. Point is >>> they demonstrate we can rely on it even when a cluster is partially >>> unhealthy, which in production is often the normal order of affairs. >>> >>> >>> > I like it. I hadn't thought about stressing quite this aggressively, but > now that I think about it, sounds like a great plan. Having some ballpark > measure to quantify the cost of a "backup-heavy" workload would be cool in > addition to seeing how the system reacts in unexpected manners. > > Sounds good to me. >> >> How will you test the restore aspect? After 1k (or whatever makes sense) >> incremental backups over the life of the chaos, could you restore and >> validate that the table had all expected data in place. >> > > Exactly. My thinking was that, at any point, we should be able to do a > restore and validate. Maybe something like: every Nth ITBLL iteration, make > a new backup point, restore a previous backup point, verify, restore to > newest backup point. The previous backup point should be a full or > incremental point. > > Vlad: I'm obviously curious to see what you think about this stuff, in > addition to what you already had in mind :) >
