I think those are reasonable criteria Josh. What I would like to see is something like "we ran ITBLL (or custom generator with similar correctness validation if you prefer) on a dev cluster (5-10 nodes) for 24 hours with server killing chaos agents active, attempted 1,440 backups (one per minute), of which 1,000 succeeded and 100% if these were successfully restored and validated." This implies your points on automation and no manual intervention. Maybe the number of successful backups under challenging conditions will be lower. Point is they demonstrate we can rely on it even when a cluster is partially unhealthy, which in production is often the normal order of affairs.
> On Sep 12, 2017, at 9:07 AM, Josh Elser <[email protected]> wrote: > >> On 9/11/17 11:52 PM, Stack wrote: >> On Mon, Sep 11, 2017 at 11:07 AM, Vladimir Rodionov <[email protected]> >> wrote: >>> ... >>> That is mostly it. Yes, We have not done real testing with real data on a >>> real cluster yet, except QA testing on a small OpenStack >>> cluster (10 nodes). That is our probably the biggest minus right now. I >>> would like to inform community that this week we are going to start >>> full scale testing with reasonably sized data sets. >>> >> ... Completion of HA seems important as is result of the scale testing. > > I think we should knock out a rough sketch on what effective "scale" testing > would look like since that is a very subjective phrase. Let me start the ball > rolling with a few things that come to my mind. > > (interpreting requirements as per rfc2119) > > * MUST have >5 RegionServers and >1 Masters in play > * MUST have Non-trivial final data sizes (final data size would be >= 100's > of GB) > * MUST have some clear pass/fail determination for correctness of B&R > * MUST have some fault-injection > > * SHOULD be a completely automated test, not require coordination of a human > to executing commands. > * SHOULD be able to acquire operational insight (metrics) while performing > operations to determine success of testing > * SHOULD NOT require manual intervention, e.g. working around known > issues/limitations > * SHOULD reuse the IntegrationTest framework in hbase-it > > Since we have a concern of correctness, ITBLL sounds like a good starting > point to avoid having to re-write similar kinds of logic. ChaosMonkey is > always great for fault-injection. > > Thoughts?
