There are at least two production cases that need to be distinguished:
The first is where a single node restart will repair the problem( and you
get the right node.  )
The other cases are those where stopping the node will invalidate it's
backups, leaving only one copy of the data, and the problem is not
resolved.  Lots of opportunities to destroy all copies.      Automated
decisions should take into account whether a node in question is the last
source of Truth.

Killing off a single bad actor using automation is safer than having humans
with the CEO screaming at them to try.
-DH


PS:  I'm just finalizing an extension which allows cache templates created
in spring to force primary and backups to different failure
domains(availability zones) ( no need for custom Java code), and have been
fretting over all the ways to lose data.

On Thu, Sep 6, 2018, 10:03 AM Andrey Kuznetsov <stku...@gmail.com> wrote:

> Igniters,
>
> Currently, we have a nearly completed implementation for system-critical
> threads liveness checking [1], in terms of IEP-14 [2] and IEP-5 [3]. In a
> nutshell, system-critical threads monitor each other and checks for two
> aspects:
> - whether a thread is alive;
> - whether a thread is active, i.e. it updates its heartbeat timestamp
> periodically.
> When either check fails, critical failure handler is called, this in fact
> means node stop.
>
> The implementation of activity checks has a flaw now: some blocking actions
> are parts of normal operation and should not lead to node stop, e.g.
> - WAL writer thread can call {{fsync()}};
> - any cache write that occurs in system striped executor can lead to
> {{fsync()}} call again.
> The former example can be fixed by disabling heartbeat checks temporarily
> for known long-running actions, but it won't work with for the latter one.
>
> I see a few options to address the issue:
> - Just log any long-running action instead of calling critical failure
> handler.
> - Introduce several severity levels for long-running actions handling. Each
> level will have its own failure handler. Depending on the level,
> long-running action can lead to node stop, error logging or no-op reaction.
>
> I encourage you to suggest other options. Any idea is appreciated.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-6587
> [2]
>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling
> [3]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=74683878
>
> --
> Best regards,
>   Andrey Kuznetsov.
>

Reply via email to