[
https://issues.apache.org/jira/browse/IGNITE-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668842#comment-16668842
]
Andrey Kuznetsov commented on IGNITE-9679:
------------------------------------------
[~Artem Budnikov], thanks, great job!
Please consider some minor remarks.
* Blocked (aka hanging) worker could be included to Critical Failures list.
* Workers of Data Streamer striped pool could be added to mission critical
worker list.
* Due to [1], blocked worker timeout configuration became a bit trickier.
Should this be mentioned in docs?
[1]
https://issues.apache.org/jira/browse/IGNITE-9737?focusedCommentId=16632210&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16632210
> Document critical workers liveness checking implementation
> ----------------------------------------------------------
>
> Key: IGNITE-9679
> URL: https://issues.apache.org/jira/browse/IGNITE-9679
> Project: Ignite
> Issue Type: Task
> Components: documentation
> Reporter: Andrey Kuznetsov
> Assignee: Andrey Kuznetsov
> Priority: Major
> Fix For: 2.7
>
>
> Newly implemented critical worker thread liveness checks should be mentioned
> in Ignite Documentation. Brief description of the functionality follows.
> Ignite node has a number of critical worker threads that should be alive and
> responsive, otherwise node's health is not guaranteed. These threads monitor
> each other periodically and track two aspects for a thread being checked:
> - whether it's alive;
> - whether it updates its internal heartbeat timestamp.
> Whenever at least one of the above conditions is violated, checker thread
> logs the error and calls currently configured {{FailureHandler}}.
> {{IgniteConfiguration.SystemWorkerBlockedTimeout}} configuration property
> affects monitoring behavior. At runtime monitoring settings can be changed
> via {{FailureHandlingMxBean}}.
> By default, liveness checks are enabled, but blocked system worker detection
> will not lead to failure handler invocation, see
> {{FailureProcessor#getDefaultFailureHandler}} .
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)