[jira] [Issue Comment Deleted] (IGNITE-12617) PME-free switch should wait for recovery only at affected nodes.

Anton Vinogradov (Jira) Wed, 06 May 2020 00:26:36 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Anton Vinogradov updated IGNITE-12617:
--------------------------------------
    Comment: was deleted

(was: {panel:title=Branch: [pull/7490/head] Base: [master] : No blockers 
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
[TeamCity *--&gt; Run :: All* 
Results|https://ci.ignite.apache.org/viewLog.html?buildId=5270941&amp;buildTypeId=IgniteTests24Java8_RunAll])

> PME-free switch should wait for recovery only at affected nodes.
> ----------------------------------------------------------------
>
>                 Key: IGNITE-12617
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12617
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Anton Vinogradov
>            Assignee: Anton Vinogradov
>            Priority: Major
>              Labels: iep-45
>             Fix For: 2.9
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Since IGNITE-9913, new-topology operations allowed immediately after 
> cluster-wide recovery finished.
> But is there any reason to wait for a cluster-wide recovery if only one node 
> failed?
> In this case, we should recover only the failed node's backups.
> Unfortunately, {{RendezvousAffinityFunction}} tends to spread the node's 
> backup partitions to the whole cluster. In this case, we, obviously, have to 
> wait for cluster-wide recovery on switch.
> But what if only some nodes will be the backups for every primary?
> In case nodes combined into virtual cells where, for each partition, backups 
> located at the same cell with primaries, it's possible to finish the switch 
> outside the affected cell before tx recovery finish.
> This optimization will allow us to start and even finish new operations 
> outside the failed cell without a cluster-wide switch finish (broken cell 
> recovery) waiting.
> In other words, switch (when left/fail + baseline + rebalanced) will have 
> little effect on the operation's (not related to failed cell) latency.
> In other words
> - We should wait for tx recovery before finishing the switch only on a broken 
> cell.
> - We should wait for replicated caches tx recovery everywhere since every 
> node is a backup of a failed one.
> - Upcoming operations related to the broken cell (including all replicated 
> caches operations) will require a cluster-wide switch finish to be processed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Issue Comment Deleted] (IGNITE-12617) PME-free switch should wait for recovery only at affected nodes.

Reply via email to