Re: [Discussion] Ratis dealing with flakey I/O

Tsz Wo Sze Thu, 14 Nov 2019 11:31:05 -0800

Hi Clay,

Thanks a lot for working on this.  The approach in general looks good
to me.  As you mentioned, wrapping with FailSafe may make the code
ugly.  Also, it may be undesirable for Ratis becoming tightly coupled
with FailSafe   How about we do the following:


(1) Refactor the code so that all the system calls are moved to one
place (e,g. all the file related calls should use FileUtils.  We may
do something similar for the other system calls)
(2) Add retry support.  I prefer using our own RetryPolicy instead of
FailSafe.  I am open to make it pluggable/configurable.

How does that sound?

Tsz-Wo

On Tue, Nov 12, 2019 at 4:26 PM Clay B. <[email protected]> wrote:
>
> Hi Ratis Folks!
>
> I've been hacking on RATIS-695 - "Improve running in the face of flaky
> disks"[1] and I've gotten to the point I can run Ratis with 60% of I/O
> requests resulting in an `IOException` and the example server and
> arithmetic client can make forward progress and record consistent data.
>
> I have lots of questions which it seems wrong to keep to myself. I
> want to ensure that this patch does not risk making the code ugly or less
> maintainable. I/O in Ratis is not super tightly encapsulated and I'm
> touching 16 files.
>
> Right now, I am wrapping each I/O operation with a FailSafe object. Do
> folks have an opinions on a better Java construct to not risk code
> becoming bloated or ugly? There are some threading assumptions to these
> FailSafe object; they default to the ForkJoin pool's common thread
> pool[2].
>
> Similarly, do folks think this is a reasonable approach or am I making
> assumptions that may limit consumers' choices to incorporate Ratis? I am
> trying to keep all changes mostly kept to Ratis and at worst provide easy
> APIs for consumers to use.
>
> This patch would push to Ratis to protecting itself with retrying all
> (failing) I/O requests. What social or code construct may help ensure
> going forward new Ratis changes to continue to use this construct as
> changes to I/O happen? Perhaps, simply ensuring Jenkins runs the Namazu
> I/O fuzzing test and ensures it can write a specific amount of data okay?
>
> If you think this is a wrong direction, what approach would you prefer? An
> alternative I've considered is we pursue an architecture like Zookeeper
> which uses annotations on "critical threads". If a critical thread dies,
> the daemon will exit. However, a limitation to this approach would be us
> enforcing a requirement that consumers depending on Ratis follow this too.
> Second challenge would be MITT would be much longer than a function retry.
>
> I am curious what other's opinions are? Does anyone think this is the
> completely wrong approach or a reasonable approach?
>
> -Clay
>
> [1]: Pull request:
> https://github.com/apache/incubator-ratis/pull/45
> [2]: FailSafe schedulers description:
> https://jodah.net/failsafe/schedulers/

Re: [Discussion] Ratis dealing with flakey I/O

Reply via email to