[
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483947#comment-16483947
]
Andrey Kuznetsov commented on IGNITE-6587:
------------------------------------------
Changing critical threads to {{GridWorkers}} has been brought to separate
issue, since it has it's own value.
> Ignite watchdog service
> -----------------------
>
> Key: IGNITE-6587
> URL: https://issues.apache.org/jira/browse/IGNITE-6587
> Project: Ignite
> Issue Type: Improvement
> Components: general
> Affects Versions: 2.2
> Reporter: Alexey Goncharuk
> Assignee: Andrey Kuznetsov
> Priority: Major
> Labels: IEP-5
> Fix For: 2.6
>
> Attachments: watchdog.sh
>
>
> As described in [1], each Ignite node has a number of system-critical
> threads. We should implement a periodic check that calls failure handler when
> one of the following conditions has been detected:
> * Critical thread is not alive anymore.
> * Critical thread 'hangs' for a long time, e.g. while executing a task
> extracted from task queue.
> In case of failure condition, call stacks of all threads should be logged
> before invoking failure handler.
> Actual list of system-critical threads can be found at [1].
> Implementations based on separate diagnostic thread seem fragile, cause this
> thread become a vulnerable point with respect to thread termination and CPU
> resource starvation. So we are to use self-monitoring approach: critical
> threads themselves should monitor each other.
> Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that
> fits best to store and track system critical threads. All of them should be
> refactored to be {{GridWorker's}} and added to {{WorkersRegistry}}. Each
> worker should periodically choose some subset of peer workers and check
> whether
> * All of them are alive.
> * All of them are actively running.
> It's required to add a 'heartbeat' timestamp to worker in order to implement
> latter check. Additionally, infinite queue polls, waits on monitors or thread
> parks should be refactored to their timed equivalents in system critical
> threads.
> Monitoring parameters (enable/disable, check interval, thread 'hang'
> threshold, etc.) are to be set via system properties.
> [1]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)