[jira] [Commented] (IGNITE-23753) Replica's state doesn't change on node stopping

Mikhail Efremov (Jira) Tue, 10 Dec 2024 02:27:04 -0800


    [ 
https://issues.apache.org/jira/browse/IGNITE-23753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17904446#comment-17904446
 ]


Mikhail Efremov commented on IGNITE-23753:
------------------------------------------

The ticket is closed because in process there were found several important 
issues and bugs inside {{ReplicaStateManager}} such as:
* Uncovered replica stopping states on table destruction and node stopping.
* Memory leak due to some states aren't removed from context map event after 
replica stopping.
* Redundant state creation-deletion process on every replica under rebalance 
(even) if a node won't create replica.
* There are no tests, only several strict in-code assertions without proper 
description thus it's hard to make changes and fixes then into.

So, there a refactoring should be done, the corresponding epic IGNITE-23931 is 
created

> Replica's state doesn't change on node stopping
> -----------------------------------------------
>
>                 Key: IGNITE-23753
>                 URL: https://issues.apache.org/jira/browse/IGNITE-23753
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mikhail Efremov
>            Assignee: Mikhail Efremov
>            Priority: Major
>              Labels: ignite-3
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> *Description*
> There is a possible situation while an Ignite node was being stopped 
> {{TableManager}} desroyed all partitions and replicas already, but 
> {{ReplicaManager}} isn't even took a busy lock and e.g. 
> {{ReplicaStateManager}} after IGNITE-22036 may attemts to get already removed 
> replica future. For now we may ignore such {{null}} value, but it's clear 
> that the picture is wrong: we already deleted a replica, but it's state isn't 
> {{STOPPED}}.
> Also we may found a spagetti-like method calls between {{TableManager}} and 
> {{ReplicaManager}} on a replica's stopping process: {{TableManager}} calls 
> {{ReplicaManager#weakReplicaStop}} that call the given lambda that actually 
> {{TableManager#stopAndDestroyPartition}} that at the end calls 
> {{ReplicaManager#stopReplicaInternal}}. Probably we should separate this code.
> *Motivation*
> In the future we may have a logic that get replicas and we might get an 
> unconsistent replica's state on a node's stopping process.
> *Definition of Done*
> * New {{WeakStopReason.SHUTDOWN}} is added and is used for node stopping 
> situation in {{TableManager}}.
> * In case of {{SHUTDOWN}} reason we must set {{STOPPED}} replica state and 
> stop the replica immediately.
> * (optional) Consider an opportunity to refactor 
> {{TableManager#stopAndDestroyPartition}} in a looser couple way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-23753) Replica's state doesn't change on node stopping

Reply via email to