[ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Kuznetsov updated IGNITE-6587:
-------------------------------------
    Description: 
As described in [1], each Ignite node has a number of system-critical threads. 
We should implement a periodic check that calls failure handler when one of the 
following conditions has been detected:
* Critical thread is not alive anymore.
* Critical thread 'hangs' for a long time, e.g. while executing a task 
extracted from task queue.
Actual list of system-critical threads can be found at [1].

Implementations based on separate diagnostic thread seem fragile, cause this 
thread become a vulnerable point with respect to thread termination and CPU 
resource starvation. So we are to use self-monitoring approach: critical 
threads themselves should monitor each other.

Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that fits 
best to store and track system critical threads. All of them should be 
refactored to be {{GridWorker}}s and added to {{WorkersRegistry}}. Each worker 
should periodically choose some subset of peer workers and check whether
* All of them are alive.
* All of them are actively running.
It's required to add a 'heartbeat' timestamp to worker in order to implement 
latter check. Additionally, infinite queue polls, waits on monitors or thread 
parks should be refactored to their timed equivalents in system critical 
threads.

Monitoring parameters (check interval, thread 'hang' threshold, etc.) are to be 
set via system properties.

[1] 
https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling

  was:
As described in [1], each Ignite node has a number of system-critical threads. 
We should implement a periodic check that calls failure handler when one of the 
following conditions has been detected:
# Critical thread is not alive anymore.
# Critical thread 'hangs' for a long time, e.g. while executing a task 
extracted from task queue. 

Actual list of system-critical threads can be found at [1].

[1] 
https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling


> Ignite watchdog service
> -----------------------
>
>                 Key: IGNITE-6587
>                 URL: https://issues.apache.org/jira/browse/IGNITE-6587
>             Project: Ignite
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 2.2
>            Reporter: Alexey Goncharuk
>            Assignee: Andrey Kuznetsov
>            Priority: Major
>              Labels: IEP-5
>             Fix For: 2.6
>
>         Attachments: watchdog.sh
>
>
> As described in [1], each Ignite node has a number of system-critical 
> threads. We should implement a periodic check that calls failure handler when 
> one of the following conditions has been detected:
> * Critical thread is not alive anymore.
> * Critical thread 'hangs' for a long time, e.g. while executing a task 
> extracted from task queue.
> Actual list of system-critical threads can be found at [1].
> Implementations based on separate diagnostic thread seem fragile, cause this 
> thread become a vulnerable point with respect to thread termination and CPU 
> resource starvation. So we are to use self-monitoring approach: critical 
> threads themselves should monitor each other.
> Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that 
> fits best to store and track system critical threads. All of them should be 
> refactored to be {{GridWorker}}s and added to {{WorkersRegistry}}. Each 
> worker should periodically choose some subset of peer workers and check 
> whether
> * All of them are alive.
> * All of them are actively running.
> It's required to add a 'heartbeat' timestamp to worker in order to implement 
> latter check. Additionally, infinite queue polls, waits on monitors or thread 
> parks should be refactored to their timed equivalents in system critical 
> threads.
> Monitoring parameters (check interval, thread 'hang' threshold, etc.) are to 
> be set via system properties.
> [1] 
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to