[
https://issues.apache.org/jira/browse/IGNITE-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103757#comment-17103757
]
Alexey Scherbakov commented on IGNITE-12617:
--------------------------------------------
[~avinogradov]
Your solution has 2 drawbacks:
1. Double latch waiting if replicated caches are in topology.
2. It degrades to be a no-op if backups are spread by grid nodes (this is a
default behavior with rendezvous affinity).
I would like to propose algorythm, which should provide the same latency
decrease as for a best case by your approach and has no obvious drawbacks.
Assume a baseline node has failed.
1. As soon as all nodes have received node failed event a coordinator sends a
discovery custom CollectMaxCountersMessage for collecting max update counter
for all affected partitions (all primaries for failed node). No new transaction
can reserve a counter at this point.
2. Each node having a newly assigned primary partition waits for
countersReadyFuture before finishing an exchange future.
3. All nodes having no such partitions finish exchange future immediately.
4. A coordinator on recieving CollectMaxCountersMessage prepares a
CountersReadyMessage and sends directly to all affected nodes from step 2.
5. Each affected node received the CountersReadyMessage, applies max counter to
a partition and finishes exchange future.
This algorythm also benefits from "cells".
It can be further improved by integrating max counters collection to node fail
processing at discovery layer.
> PME-free switch should wait for recovery only at affected nodes.
> ----------------------------------------------------------------
>
> Key: IGNITE-12617
> URL: https://issues.apache.org/jira/browse/IGNITE-12617
> Project: Ignite
> Issue Type: Task
> Reporter: Anton Vinogradov
> Assignee: Anton Vinogradov
> Priority: Major
> Labels: iep-45
> Fix For: 2.9
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Since IGNITE-9913, new-topology operations allowed immediately after
> cluster-wide recovery finished.
> But is there any reason to wait for a cluster-wide recovery if only one node
> failed?
> In this case, we should recover only the failed node's backups.
> Unfortunately, {{RendezvousAffinityFunction}} tends to spread the node's
> backup partitions to the whole cluster. In this case, we, obviously, have to
> wait for cluster-wide recovery on switch.
> But what if only some nodes will be the backups for every primary?
> In case nodes combined into virtual cells where, for each partition, backups
> located at the same cell with primaries, it's possible to finish the switch
> outside the affected cell before tx recovery finish.
> This optimization will allow us to start and even finish new operations
> outside the failed cell without a cluster-wide switch finish (broken cell
> recovery) waiting.
> In other words, switch (when left/fail + baseline + rebalanced) will have
> little effect on the operation's (not related to failed cell) latency.
> In other words
> - We should wait for tx recovery before finishing the switch only on a broken
> cell.
> - We should wait for replicated caches tx recovery everywhere since every
> node is a backup of a failed one.
> - Upcoming operations related to the broken cell (including all replicated
> caches operations) will require a cluster-wide switch finish to be processed.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)