Non-blocking PME Phase One (Node fail)

Anton Vinogradov Wed, 18 Sep 2019 00:53:39 -0700

Igniters,

Recently we had a discussion devoted to the non-blocking PME.
We agreed that the most important case is a blocking on node failure and it
can be splitted to:

1) Affected partition’s operations latency will be increased by node
failure detection duration.
So, some operations may be freezed for 10+ seconds at real clusters just
waiting for a failed primary response.
In other words, some operations will be blocked even before blocking PME
started.

The good news here that "bigger cluster decrease blocked operations
percent".

Bad news that these operations may block non-affected operations at
- customers code (single_thread/striped pool usage)
- multikey operations (tx1 one locked A and waits for failed B,
non-affected tx2 waits for A)
- striped pools inside AI (when some task wais for tx.op() in sync way and
the striped thread is busy)
- etc ...

Seems, we already, thanks to StopNodeFailureHandler (if configured), always
send node left event before node stop to minimize the waiting period.
So, only cases cause the hang without the stop are the problems now.

Anyway, some additional research required here and it will be nice if
someone willing to help.

2) Some optimizations may speed-up node left case (eliminate upcoming
operations blocking).
A full list can be found at presentation [1].
List contains 8 optimizations, but I propose to implement some at phase one
and the rest at phase two.
Assuming that real production deployment has Baseline enabled we able to
gain speed-up by implementing the following:

#1 Switch on node_fail/node_left event locally instead of starting real PME
(Local switch).
Since BLT enabled we always able to switch to the new-affinity primaries
(no need to preload partitions).
In case we're not able to switch to new-affinity primaries (all missed or
BLT disabled) we'll just start regular PME.
The new-primary calculation can be performed locally or by the coordinator
(eg. attached to the node_fail message).

#2 We should not wait for any already started operations completion (since
they not related to failed primary partitions).
The only problem is a recovery which may cause update-counters duplications
in case of unsynced HWM.

#2.1 We may wait only for recovery completion (Micro-blocking switch).
Just block (all at this phase) upcoming operations during the recovery by
incrementing the topology version.
So in other words, it will be some kind of PME with waiting, but it will
wait for recovery (fast) instead of finishing current operations (long).

#2.2 Recovery, theoretically, can be async.
We have to solve unsynced HWM issue (to avoid concurrent usage of the same
counters) to make it happen.
We may just increment HWM with IGNITE_MAX_COMPLETED_TX_COUNT at new-primary
and continue recovery in an async way.
Currently, IGNITE_MAX_COMPLETED_TX_COUNT specifies the number of committed
transactions we expect between "the first backup committed tx1" and "the
last backup committed the same tx1".
I propose to use it to specify the number of prepared transactions we
expect between "the first backup prepared tx1" and "the last backup
prepared the same tx1".
Both cases look pretty similar.
In this case, we able to make switch fully non-blocking, with async
recovery.
Thoughts?

So, I'm going to implement both improvements at "Lightweight version of
partitions map exchange" issue [2] if no one minds.

[1]
https://docs.google.com/presentation/d/1Ay7OZk_iiJwBCcA8KFOlw6CRmKPXkkyxCXy_JNg4b0Q/edit?usp=sharing
[2] https://issues.apache.org/jira/browse/IGNITE-9913

Non-blocking PME Phase One (Node fail)

Reply via email to