It would be safer to restart the entire cluster than to remove the last node for a cache that should be redundant.
On Sun, Sep 9, 2018, 4:00 PM Andrey Gura <ag...@apache.org> wrote: > Hi, > > I agree with Yakov that we can provide some option that manage worker > liveness checker behavior in case of observing that some worker is > blocked too long. > At least it will some workaround for cases when node fails is too > annoying. > > Backups count threshold sounds good but I don't understand how it will > help in case of cluster hanging. > > The simplest solution here is alert in cases of blocking of some > critical worker (we can improve WorkersRegistry for this purpose and > expose list of blocked workers) and optionally call system configured > failure processor. BTW, failure processor can be extended in order to > perform any checks (e.g. backup count) and decide whether it should > stop node or not. > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov <stku...@gmail.com> wrote: > > > > David, Yakov, I understand your fears. But liveness checks deal with > > _critical_ conditions, i.e. when such a condition is met we conclude the > > node as totally broken, and there is no sense to keep it alive regardless > > the data it contains. If we want to give it a chance, then the condition > > (long fsync etc.) should not considered as critical at all. > > > > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov <yzhda...@apache.org>: > > > > > Agree with David. We need to have an opporunity set backups count > threshold > > > (at runtime also!) that will not allow any automatic stop if there > will be > > > a data loss. Andrey, what do you think? > > > > > > --Yakov > > > > > > > > > -- > > Best regards, > > Andrey Kuznetsov. >