[
https://issues.apache.org/jira/browse/IGNITE-20828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Puchkovskiy updated IGNITE-20828:
---------------------------------------
Description:
When TopologyAwareRaftGroupService is shutdown, it tries to unsubscribe itself
from all peers. If the unsubscription fails, it tries to get the logical
topology (calling the CMG leader with RAFT), check that the target node is
still in the topology, and if yes, retry the unsubscription request. So, if the
CMG leader has already left the topology, an attempt to check the logical
topology will take 10 seconds. This makes partition stop in TableManager
timeout (as it has a limit of 10 seconds), which in turn results in a partition
group staying registered with Loza even after TableManager#stop() returns,
which causes Loza#stop() to fail the Ignite node stop procedure (leaving
HTTP(S) ports bound).
It seems that it makes no sense to retry unsubscription requests at all.
was:
When TopologyAwareRaftGroupService is shutdown, it tries to unsubscribe itself
from all peers. If the unsubscription fails, it tries to get the logical
topology (calling the CMG leader with RAFT), check that the target node is
still in the topology, and if yes, retry the unsubscription request. So, if the
CMG leader has already left the topology, an attempt to check the logical
topology will take 10 seconds. This makes partition stop in TableManager
timeout (as it has a limit of 10 seconds), which in turn results in a partition
group staying registered with Loza even after TableManager#stop() returns,
which causes Loza#stop() to fail the Ignite node stop procedure (leaving
HTTP(S) ports bound).
It seems that it makes no sense to retry unsubscription requests at all. Even
more, subscription requests should not be retries as well (instead, propagating
the exception right away). The difference between the scenarios should be that
for unsubscription an exception should never be propagated (if it's not an
Error).
> Do not retry attempts to unsubscribe in TopologyAwareRaftGroupService
> ---------------------------------------------------------------------
>
> Key: IGNITE-20828
> URL: https://issues.apache.org/jira/browse/IGNITE-20828
> Project: Ignite
> Issue Type: Bug
> Reporter: Roman Puchkovskiy
> Assignee: Roman Puchkovskiy
> Priority: Major
> Labels: ignite-3
> Fix For: 3.0.0-beta2
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> When TopologyAwareRaftGroupService is shutdown, it tries to unsubscribe
> itself from all peers. If the unsubscription fails, it tries to get the
> logical topology (calling the CMG leader with RAFT), check that the target
> node is still in the topology, and if yes, retry the unsubscription request.
> So, if the CMG leader has already left the topology, an attempt to check the
> logical topology will take 10 seconds. This makes partition stop in
> TableManager timeout (as it has a limit of 10 seconds), which in turn results
> in a partition group staying registered with Loza even after
> TableManager#stop() returns, which causes Loza#stop() to fail the Ignite node
> stop procedure (leaving HTTP(S) ports bound).
> It seems that it makes no sense to retry unsubscription requests at all.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)