[ https://issues.apache.org/jira/browse/IGNITE-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anton Vinogradov updated IGNITE-12617: -------------------------------------- Comment: was deleted (was: {panel:title=Branch: [pull/7490/head] Base: [master] : No blockers found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel} [TeamCity *--> Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=5270941&buildTypeId=IgniteTests24Java8_RunAll]) > PME-free switch should wait for recovery only at affected nodes. > ---------------------------------------------------------------- > > Key: IGNITE-12617 > URL: https://issues.apache.org/jira/browse/IGNITE-12617 > Project: Ignite > Issue Type: Task > Reporter: Anton Vinogradov > Assignee: Anton Vinogradov > Priority: Major > Labels: iep-45 > Fix For: 2.9 > > Time Spent: 10m > Remaining Estimate: 0h > > Since IGNITE-9913, new-topology operations allowed immediately after > cluster-wide recovery finished. > But is there any reason to wait for a cluster-wide recovery if only one node > failed? > In this case, we should recover only the failed node's backups. > Unfortunately, {{RendezvousAffinityFunction}} tends to spread the node's > backup partitions to the whole cluster. In this case, we, obviously, have to > wait for cluster-wide recovery on switch. > But what if only some nodes will be the backups for every primary? > In case nodes combined into virtual cells where, for each partition, backups > located at the same cell with primaries, it's possible to finish the switch > outside the affected cell before tx recovery finish. > This optimization will allow us to start and even finish new operations > outside the failed cell without a cluster-wide switch finish (broken cell > recovery) waiting. > In other words, switch (when left/fail + baseline + rebalanced) will have > little effect on the operation's (not related to failed cell) latency. > In other words > - We should wait for tx recovery before finishing the switch only on a broken > cell. > - We should wait for replicated caches tx recovery everywhere since every > node is a backup of a failed one. > - Upcoming operations related to the broken cell (including all replicated > caches operations) will require a cluster-wide switch finish to be processed. -- This message was sent by Atlassian Jira (v8.3.4#803005)