[ 
https://issues.apache.org/jira/browse/IGNITE-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103757#comment-17103757
 ] 

Alexey Scherbakov edited comment on IGNITE-12617 at 5/11/20, 8:25 AM:
----------------------------------------------------------------------

[~avinogradov]

Your solution has 2 drawbacks:

1. Double latch waiting if replicated caches are in topology.
2. It degrades to be a no-op if backups are spread by grid nodes (this is a 
default behavior with rendezvous affinity).

I would like to propose an algorithm, which should provide the same latency 
decrease as for a best case by your approach and has no obvious drawbacks.

Assume a baseline node has failed.

1. As soon as all nodes have received node failed event a coordinator sends a 
discovery custom CollectMaxCountersMessage for collecting max update counter 
for all affected partitions (all primaries for failed node). No new transaction 
can reserve a counter at this point.
2. Each node having a newly assigned primary partition waits for 
countersReadyFuture before finishing an exchange future.
3. All nodes having no such partitions finish exchange future immediately.
4. A coordinator on recieving CollectMaxCountersMessage prepares a 
CountersReadyMessage and sends directly to all affected nodes from step 2.
5. Each affected node received the CountersReadyMessage, applies max counter to 
a partition and finishes exchange future.

The algorithm also benefits from "cells".
It can be further improved by integrating max counters collection to node fail 
processing at discovery layer.







was (Author: ascherbakov):
[~avinogradov]

Your solution has 2 drawbacks:

1. Double latch waiting if replicated caches are in topology.
2. It degrades to be a no-op if backups are spread by grid nodes (this is a 
default behavior with rendezvous affinity).

I would like to propose algorythm, which should provide the same latency 
decrease as for a best case by your approach and has no obvious drawbacks.

Assume a baseline node has failed.

1. As soon as all nodes have received node failed event a coordinator sends a 
discovery custom CollectMaxCountersMessage for collecting max update counter 
for all affected partitions (all primaries for failed node). No new transaction 
can reserve a counter at this point.
2. Each node having a newly assigned primary partition waits for 
countersReadyFuture before finishing an exchange future.
3. All nodes having no such partitions finish exchange future immediately.
4. A coordinator on recieving CollectMaxCountersMessage prepares a 
CountersReadyMessage and sends directly to all affected nodes from step 2.
5. Each affected node received the CountersReadyMessage, applies max counter to 
a partition and finishes exchange future.

This algorythm also benefits from "cells".
It can be further improved by integrating max counters collection to node fail 
processing at discovery layer.






> PME-free switch should wait for recovery only at affected nodes.
> ----------------------------------------------------------------
>
>                 Key: IGNITE-12617
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12617
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Anton Vinogradov
>            Assignee: Anton Vinogradov
>            Priority: Major
>              Labels: iep-45
>             Fix For: 2.9
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Since IGNITE-9913, new-topology operations allowed immediately after 
> cluster-wide recovery finished.
> But is there any reason to wait for a cluster-wide recovery if only one node 
> failed?
> In this case, we should recover only the failed node's backups.
> Unfortunately, {{RendezvousAffinityFunction}} tends to spread the node's 
> backup partitions to the whole cluster. In this case, we, obviously, have to 
> wait for cluster-wide recovery on switch.
> But what if only some nodes will be the backups for every primary?
> In case nodes combined into virtual cells where, for each partition, backups 
> located at the same cell with primaries, it's possible to finish the switch 
> outside the affected cell before tx recovery finish.
> This optimization will allow us to start and even finish new operations 
> outside the failed cell without a cluster-wide switch finish (broken cell 
> recovery) waiting.
> In other words, switch (when left/fail + baseline + rebalanced) will have 
> little effect on the operation's (not related to failed cell) latency.
> In other words
> - We should wait for tx recovery before finishing the switch only on a broken 
> cell.
> - We should wait for replicated caches tx recovery everywhere since every 
> node is a backup of a failed one.
> - Upcoming operations related to the broken cell (including all replicated 
> caches operations) will require a cluster-wide switch finish to be processed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to