It would be safer to restart the entire cluster than to remove the last
node for a cache that should be redundant.

On Sun, Sep 9, 2018, 4:00 PM Andrey Gura <ag...@apache.org> wrote:

> Hi,
>
> I agree with Yakov that we can provide some option that manage worker
> liveness checker behavior in case of observing that some worker is
> blocked too long.
> At least it will  some workaround for cases when node fails is too
> annoying.
>
> Backups count threshold sounds good but I don't understand how it will
> help in case of cluster hanging.
>
> The simplest solution here is alert in cases of blocking of some
> critical worker (we can improve WorkersRegistry for this purpose and
> expose list of blocked workers) and optionally call system configured
> failure processor. BTW, failure processor can be extended in order to
> perform any checks (e.g. backup count) and decide whether it should
> stop node or not.
> On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov <stku...@gmail.com> wrote:
> >
> > David, Yakov, I understand your fears. But liveness checks deal with
> > _critical_ conditions, i.e. when such a condition is met we conclude the
> > node as totally broken, and there is no sense to keep it alive regardless
> > the data it contains. If we want to give it a chance, then the condition
> > (long fsync etc.) should not considered as critical at all.
> >
> > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov <yzhda...@apache.org>:
> >
> > > Agree with David. We need to have an opporunity set backups count
> threshold
> > > (at runtime also!) that will not allow any automatic stop if there
> will be
> > > a data loss. Andrey, what do you think?
> > >
> > > --Yakov
> > >
> >
> >
> > --
> > Best regards,
> >   Andrey Kuznetsov.
>

Reply via email to