Igniters, Recently we had a discussion devoted to the non-blocking PME. We agreed that the most important case is a blocking on node failure and it can be splitted to:
1) Affected partition’s operations latency will be increased by node failure detection duration. So, some operations may be freezed for 10+ seconds at real clusters just waiting for a failed primary response. In other words, some operations will be blocked even before blocking PME started. The good news here that "bigger cluster decrease blocked operations percent". Bad news that these operations may block non-affected operations at - customers code (single_thread/striped pool usage) - multikey operations (tx1 one locked A and waits for failed B, non-affected tx2 waits for A) - striped pools inside AI (when some task wais for tx.op() in sync way and the striped thread is busy) - etc ... Seems, we already, thanks to StopNodeFailureHandler (if configured), always send node left event before node stop to minimize the waiting period. So, only cases cause the hang without the stop are the problems now. Anyway, some additional research required here and it will be nice if someone willing to help. 2) Some optimizations may speed-up node left case (eliminate upcoming operations blocking). A full list can be found at presentation [1]. List contains 8 optimizations, but I propose to implement some at phase one and the rest at phase two. Assuming that real production deployment has Baseline enabled we able to gain speed-up by implementing the following: #1 Switch on node_fail/node_left event locally instead of starting real PME (Local switch). Since BLT enabled we always able to switch to the new-affinity primaries (no need to preload partitions). In case we're not able to switch to new-affinity primaries (all missed or BLT disabled) we'll just start regular PME. The new-primary calculation can be performed locally or by the coordinator (eg. attached to the node_fail message). #2 We should not wait for any already started operations completion (since they not related to failed primary partitions). The only problem is a recovery which may cause update-counters duplications in case of unsynced HWM. #2.1 We may wait only for recovery completion (Micro-blocking switch). Just block (all at this phase) upcoming operations during the recovery by incrementing the topology version. So in other words, it will be some kind of PME with waiting, but it will wait for recovery (fast) instead of finishing current operations (long). #2.2 Recovery, theoretically, can be async. We have to solve unsynced HWM issue (to avoid concurrent usage of the same counters) to make it happen. We may just increment HWM with IGNITE_MAX_COMPLETED_TX_COUNT at new-primary and continue recovery in an async way. Currently, IGNITE_MAX_COMPLETED_TX_COUNT specifies the number of committed transactions we expect between "the first backup committed tx1" and "the last backup committed the same tx1". I propose to use it to specify the number of prepared transactions we expect between "the first backup prepared tx1" and "the last backup prepared the same tx1". Both cases look pretty similar. In this case, we able to make switch fully non-blocking, with async recovery. Thoughts? So, I'm going to implement both improvements at "Lightweight version of partitions map exchange" issue [2] if no one minds. [1] https://docs.google.com/presentation/d/1Ay7OZk_iiJwBCcA8KFOlw6CRmKPXkkyxCXy_JNg4b0Q/edit?usp=sharing [2] https://issues.apache.org/jira/browse/IGNITE-9913