Igniters, PME-free switch [1] (since 2.8) skips PME on node left when possible (baseline + fully rebalanced cluster). This means we already wait for nothing (except recovery) to perform the switch. This optimization allows continuing already started operations during or after the switch if they are not affected by failed primary. But upcoming operations still can't be started until the switch is finished cluster-wide.
Let me propose an additional optimization - Cellular switch. Cellular Affinity [2] means that nodes combined into virtual cells where, for each partition, backups located at the same cell with primaries. The simplest way to gain Cellular Affinity is to use backup filters [3]. Cellular Affinity allows to finish the switch outside the affected cell instantly with the following assumptions: - Replicated caches should be recovered first since every node affected (as a backup) by any failed primary. But, it is expected that replicated caches effectively read-only (has extremely rare updates), so, nothing to wait here. - Upcoming replicated transactions (with non-failed primaries) can be started but can't be committed until switch finished cluster-wide. - Upcoming transactions related to the broken cell will wait for cell recovery (cluster-wide switch finish). ... and this means: In addition to PME-free switch, where we able to continue already started operations during or after the switch, now we also able to perform most of the upcoming operations during the switch. In other words, Cellular switch has little effect on the operation's latency, when operation not related to the failed cell. According to benchmark [4] which checks "how fast upcoming transactions (started after switch start) can be committed when we have thousands of prepared transactions (prepared before switch start)", we have 5326 ms [5] operation's latency on master and 65 ms [6] with the proposed fix, which is ~100 times faster. Fix [7] (as a part of IEP-45 [8]) ready to be reviewed. Waiting for your review! [1] http://apache-ignite-developers.2346864.n4.nabble.com/Non-blocking-PME-Phase-One-Node-fail-tp43531p44586.html [2] https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up#IEP-45:CrashRecoverySpeed-Up-Cellularswitch [3] https://gist.github.com/anton-vinogradov/c50f9d0ce3e3e2997646f84ba7eba5f5#file-bench-java-L417 [4] https://gist.github.com/anton-vinogradov/c50f9d0ce3e3e2997646f84ba7eba5f5 [5] https://gist.github.com/anton-vinogradov/a35a3a8151b7494aa84b83f58cb75889#file-master-txt-L15 [6] https://gist.github.com/anton-vinogradov/a35a3a8151b7494aa84b83f58cb75889#file-fix-txt-L15 [7] https://issues.apache.org/jira/browse/IGNITE-12617 [8] https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up