Andrey, finally your change is merged to master branch. Congratulations and thank you very much! :)
I think that the next step is feature that will allow signal about blocked threads to the monitoring tools via MXBean. I hope you will continue development of this feature and provide your vision in new JIRA issue. On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov <stku...@gmail.com> wrote: > > David, Maxim! > > Thanks a lot for you ideas. Unfortunately, I can't adopt all of them right > now: the scope is much broader than the scope of the change I implement. I > have had a talk to a group of Ignite commiters, and we agreed to complete > the change as follows. > - Blocking instructions in system-critical which may resonably last long > should be explicitly excluded from the monitoring. > - Failure handlers should have a setting to suppress some failures on > per-failure-type basis. > According to this I have updated the implementation: [1] > > [1] https://github.com/apache/ignite/pull/4089 > > пн, 10 сент. 2018 г. в 22:35, David Harvey <syssoft...@gmail.com>: > > > When I've done this before,I've needed to find the oldest thread, and kill > > the node running that. From a language standpoint, Maxim's "without > > progress" better than "heartbeat". For example, what I'm most interested > > in on a distributed system is which thread started the work it has not > > completed the earliest, and when did that thread last make forward > > process. You don't want to kill a node because a thread is waiting on a > > lock held by a thread that went off-node and has not gotten a response. > > If you don't understand the dependency relationships, you will make > > incorrect recovery decisions. > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov <maxmu...@gmail.com> > > wrote: > > > > > I think we should find exact answers to these questions: > > > 1. What `critical` issue exactly is? > > > 2. How can we find critical issues? > > > 3. How can we handle critical issues? > > > > > > First, > > > - Ignore uninterruptable actions (e.g. worker\service shutdown) > > > - Long I/O operations (should be a configurable timeout for each type of > > > usage) > > > - Infinite loops > > > - Stalled\deadlocked threads (and\or too many parked threads, exclude > > I/O) > > > > > > Second, > > > - The working queue is without progress (e.g. disco, exchange queues) > > > - Work hasn't been completed since the last heartbeat (checking > > > milestones) > > > - Too many system resources used by a thread for the long period of time > > > (allocated memory, CPU) > > > - Timing fields associated with each thread status exceeded a maximum > > time > > > limit. > > > > > > Third (not too many options here), > > > - `log everything` should be the default behaviour in all these cases, > > > since it may be difficult to find the cause after the restart. > > > - Wait some interval of time and kill the hanging node (cluster should > > be > > > configured stable enough) > > > > > > Questions, > > > - Not sure, but can workers miss their heartbeat deadlines if CPU loads > > up > > > to 80%-90%? Bursts of momentary overloads can be > > > expected behaviour as a normal part of system operations. > > > - Why do we decide that critical thread should monitor each other? For > > > instance, if all the tasks were blocked and unable to run, > > > node reset would never occur. As for me, a better solution is to use > > a > > > separate monitor thread or pool (maybe both with software > > > and hardware checks) that not only checks heartbeats but monitors the > > > other system as well. > > > > > > On Mon, 10 Sep 2018 at 00:07 David Harvey <syssoft...@gmail.com> wrote: > > > > > > > It would be safer to restart the entire cluster than to remove the last > > > > node for a cache that should be redundant. > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura <ag...@apache.org> wrote: > > > > > > > > > Hi, > > > > > > > > > > I agree with Yakov that we can provide some option that manage worker > > > > > liveness checker behavior in case of observing that some worker is > > > > > blocked too long. > > > > > At least it will some workaround for cases when node fails is too > > > > > annoying. > > > > > > > > > > Backups count threshold sounds good but I don't understand how it > > will > > > > > help in case of cluster hanging. > > > > > > > > > > The simplest solution here is alert in cases of blocking of some > > > > > critical worker (we can improve WorkersRegistry for this purpose and > > > > > expose list of blocked workers) and optionally call system configured > > > > > failure processor. BTW, failure processor can be extended in order to > > > > > perform any checks (e.g. backup count) and decide whether it should > > > > > stop node or not. > > > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov <stku...@gmail.com> > > > > wrote: > > > > > > > > > > > > David, Yakov, I understand your fears. But liveness checks deal > > with > > > > > > _critical_ conditions, i.e. when such a condition is met we > > conclude > > > > the > > > > > > node as totally broken, and there is no sense to keep it alive > > > > regardless > > > > > > the data it contains. If we want to give it a chance, then the > > > > condition > > > > > > (long fsync etc.) should not considered as critical at all. > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov <yzhda...@apache.org>: > > > > > > > > > > > > > Agree with David. We need to have an opporunity set backups count > > > > > threshold > > > > > > > (at runtime also!) that will not allow any automatic stop if > > there > > > > > will be > > > > > > > a data loss. Andrey, what do you think? > > > > > > > > > > > > > > --Yakov > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best regards, > > > > > > Andrey Kuznetsov. > > > > > > > > > > > > -- > > > -- > > > Maxim Muzafarov > > > > > > > > -- > Best regards, > Andrey Kuznetsov.