Hi Ratis Folks!
I've been hacking on RATIS-695 - "Improve running in the face of flaky
disks"[1] and I've gotten to the point I can run Ratis with 60% of I/O
requests resulting in an `IOException` and the example server and
arithmetic client can make forward progress and record consistent data.
I have lots of questions which it seems wrong to keep to myself. I
want to ensure that this patch does not risk making the code ugly or less
maintainable. I/O in Ratis is not super tightly encapsulated and I'm
touching 16 files.
Right now, I am wrapping each I/O operation with a FailSafe object. Do
folks have an opinions on a better Java construct to not risk code
becoming bloated or ugly? There are some threading assumptions to these
FailSafe object; they default to the ForkJoin pool's common thread
pool[2].
Similarly, do folks think this is a reasonable approach or am I making
assumptions that may limit consumers' choices to incorporate Ratis? I am
trying to keep all changes mostly kept to Ratis and at worst provide easy
APIs for consumers to use.
This patch would push to Ratis to protecting itself with retrying all
(failing) I/O requests. What social or code construct may help ensure
going forward new Ratis changes to continue to use this construct as
changes to I/O happen? Perhaps, simply ensuring Jenkins runs the Namazu
I/O fuzzing test and ensures it can write a specific amount of data okay?
If you think this is a wrong direction, what approach would you prefer? An
alternative I've considered is we pursue an architecture like Zookeeper
which uses annotations on "critical threads". If a critical thread dies,
the daemon will exit. However, a limitation to this approach would be us
enforcing a requirement that consumers depending on Ratis follow this too.
Second challenge would be MITT would be much longer than a function retry.
I am curious what other's opinions are? Does anyone think this is the
completely wrong approach or a reasonable approach?
-Clay
[1]: Pull request:
https://github.com/apache/incubator-ratis/pull/45
[2]: FailSafe schedulers description:
https://jodah.net/failsafe/schedulers/
- [Discussion] Ratis dealing with flakey I/O Clay B.
-