[ 
https://issues.apache.org/jira/browse/IGNITE-20828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Puchkovskiy updated IGNITE-20828:
---------------------------------------
    Description: 
When TopologyAwareRaftGroupService is shutdown, it tries to unsubscribe itself 
from all peers. If the unsubscription fails, it tries to get the logical 
topology (calling the CMG leader with RAFT), check that the target node is 
still in the topology, and if yes, retry the unsubscription request. So, if the 
CMG leader has already left the topology, an attempt to check the logical 
topology will take 10 seconds. This makes partition stop in TableManager 
timeout (as it has a limit of 10 seconds), which in turn results in a partition 
group staying registered with Loza even after TableManager#stop() returns, 
which causes Loza#stop() to fail the Ignite node stop procedure (leaving 
HTTP(S) ports bound).

It seems that it makes no sense to retry unsubscription requests at all.

  was:
When TopologyAwareRaftGroupService is shutdown, it tries to unsubscribe itself 
from all peers. If the unsubscription fails, it tries to get the logical 
topology (calling the CMG leader with RAFT), check that the target node is 
still in the topology, and if yes, retry the unsubscription request. So, if the 
CMG leader has already left the topology, an attempt to check the logical 
topology will take 10 seconds. This makes partition stop in TableManager 
timeout (as it has a limit of 10 seconds), which in turn results in a partition 
group staying registered with Loza even after TableManager#stop() returns, 
which causes Loza#stop() to fail the Ignite node stop procedure (leaving 
HTTP(S) ports bound).

It seems that it makes no sense to retry unsubscription requests at all. Even 
more, subscription requests should not be retries as well (instead, propagating 
the exception right away). The difference between the scenarios should be that 
for unsubscription an exception should never be propagated (if it's not an 
Error).


> Do not retry attempts to unsubscribe in TopologyAwareRaftGroupService
> ---------------------------------------------------------------------
>
>                 Key: IGNITE-20828
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20828
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Roman Puchkovskiy
>            Assignee: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When TopologyAwareRaftGroupService is shutdown, it tries to unsubscribe 
> itself from all peers. If the unsubscription fails, it tries to get the 
> logical topology (calling the CMG leader with RAFT), check that the target 
> node is still in the topology, and if yes, retry the unsubscription request. 
> So, if the CMG leader has already left the topology, an attempt to check the 
> logical topology will take 10 seconds. This makes partition stop in 
> TableManager timeout (as it has a limit of 10 seconds), which in turn results 
> in a partition group staying registered with Loza even after 
> TableManager#stop() returns, which causes Loza#stop() to fail the Ignite node 
> stop procedure (leaving HTTP(S) ports bound).
> It seems that it makes no sense to retry unsubscription requests at all.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to