On 11/8/17 1:26 PM, Andrew Purtell wrote:
I won't speak to the timing aspects of this, that's up to the RM, but the
testing details look reasonable to me.
Understood and agree. Thanks for your input!
With respect to chaos testing, the
following goals would be good:
- Some backups and restores succeed even with masters and RSes going up and
down. The resiliency can always be improved later, but we can't rely on no
failures for entire duration of backup or restore operation to get a good
result, especially for restore.
Yup! The expectation (if not explicitly stated) would be that we would
work our way up to the ServerKilling monkey. The expectation is that
this would be trivial to implement - IntegrationTestBase would wire it
up for us.
- Backups are not corrupted by failures. Or, corrupted (partial?) backups
are identified and ignored and there are still good backups remaining which
can be used for restore.
- When the verification tool says a backup and restore are good, they
really are.
/me nods. Agreed.
I think we'll learn a bit about failure situations (doc intentionally
avoided defining problems/solution) and the problems we see will help
shape what the solutions we need to make are.