I think we should find exact answers to these questions:
 1. What `critical` issue exactly is?
 2. How can we find critical issues?
 3. How can we handle critical issues?

First,
 - Ignore uninterruptable actions (e.g. worker\service shutdown)
 - Long I/O operations (should be a configurable timeout for each type of
usage)
 - Infinite loops
 - Stalled\deadlocked threads (and\or too many parked threads, exclude I/O)

Second,
 - The working queue is without progress (e.g. disco, exchange queues)
 - Work hasn't been completed since the last heartbeat (checking milestones)
 - Too many system resources used by a thread for the long period of time
(allocated memory, CPU)
 - Timing fields associated with each thread status exceeded a maximum time
limit.

Third (not too many options here),
 - `log everything` should be the default behaviour in all these cases,
since it may be difficult to find the cause after the restart.
 - Wait some interval of time and kill the hanging node (cluster should be
configured stable enough)

Questions,
 - Not sure, but can workers miss their heartbeat deadlines if CPU loads up
to 80%-90%? Bursts of momentary overloads can be
    expected behaviour as a normal part of system operations.
 - Why do we decide that critical thread should monitor each other? For
instance, if all the tasks were blocked and unable to run,
    node reset would never occur. As for me, a better solution is to use a
separate monitor thread or pool (maybe both with software
    and hardware checks) that not only checks heartbeats but monitors the
other system as well.

On Mon, 10 Sep 2018 at 00:07 David Harvey <syssoft...@gmail.com> wrote:

> It would be safer to restart the entire cluster than to remove the last
> node for a cache that should be redundant.
>
> On Sun, Sep 9, 2018, 4:00 PM Andrey Gura <ag...@apache.org> wrote:
>
> > Hi,
> >
> > I agree with Yakov that we can provide some option that manage worker
> > liveness checker behavior in case of observing that some worker is
> > blocked too long.
> > At least it will  some workaround for cases when node fails is too
> > annoying.
> >
> > Backups count threshold sounds good but I don't understand how it will
> > help in case of cluster hanging.
> >
> > The simplest solution here is alert in cases of blocking of some
> > critical worker (we can improve WorkersRegistry for this purpose and
> > expose list of blocked workers) and optionally call system configured
> > failure processor. BTW, failure processor can be extended in order to
> > perform any checks (e.g. backup count) and decide whether it should
> > stop node or not.
> > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov <stku...@gmail.com>
> wrote:
> > >
> > > David, Yakov, I understand your fears. But liveness checks deal with
> > > _critical_ conditions, i.e. when such a condition is met we conclude
> the
> > > node as totally broken, and there is no sense to keep it alive
> regardless
> > > the data it contains. If we want to give it a chance, then the
> condition
> > > (long fsync etc.) should not considered as critical at all.
> > >
> > > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov <yzhda...@apache.org>:
> > >
> > > > Agree with David. We need to have an opporunity set backups count
> > threshold
> > > > (at runtime also!) that will not allow any automatic stop if there
> > will be
> > > > a data loss. Andrey, what do you think?
> > > >
> > > > --Yakov
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > >   Andrey Kuznetsov.
> >
>
-- 
--
Maxim Muzafarov

Reply via email to