Critical worker threads liveness checking drawbacks

Andrey Kuznetsov Thu, 06 Sep 2018 07:03:31 -0700

Igniters,

Currently, we have a nearly completed implementation for system-critical
threads liveness checking [1], in terms of IEP-14 [2] and IEP-5 [3]. In a
nutshell, system-critical threads monitor each other and checks for two
aspects:
- whether a thread is alive;
- whether a thread is active, i.e. it updates its heartbeat timestamp
periodically.
When either check fails, critical failure handler is called, this in fact
means node stop.


The implementation of activity checks has a flaw now: some blocking actions
are parts of normal operation and should not lead to node stop, e.g.
- WAL writer thread can call {{fsync()}};
- any cache write that occurs in system striped executor can lead to
{{fsync()}} call again.
The former example can be fixed by disabling heartbeat checks temporarily
for known long-running actions, but it won't work with for the latter one.

I see a few options to address the issue:
- Just log any long-running action instead of calling critical failure
handler.
- Introduce several severity levels for long-running actions handling. Each
level will have its own failure handler. Depending on the level,
long-running action can lead to node stop, error logging or no-op reaction.

I encourage you to suggest other options. Any idea is appreciated.

[1] https://issues.apache.org/jira/browse/IGNITE-6587
[2]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling
[3]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=74683878

--
Best regards,
  Andrey Kuznetsov.

Critical worker threads liveness checking drawbacks

Reply via email to