[
https://issues.apache.org/jira/browse/IGNITE-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313867#comment-17313867
]
Mikhail Petrov commented on IGNITE-14301:
-----------------------------------------
It seems that the https://issues.apache.org/jira/browse/IGNITE-14375 fixes this
issue.
> Authentication processor can hang all user management operation after server
> node reconnect
> -------------------------------------------------------------------------------------------
>
> Key: IGNITE-14301
> URL: https://issues.apache.org/jira/browse/IGNITE-14301
> Project: Ignite
> Issue Type: Bug
> Reporter: Mikhail Petrov
> Priority: Major
>
> First for all look at the test -
> AuthenticationProcessorNodeRestartTest#testConcurrentAddUpdateRemoveNodeRestartServer
> which is flaky - [TC
> history|https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-8873434544416175780&tab=testDetails]
> The first problem with this test is that user management
> operations(add/update/remove) create too many discovery messages. So
> discovery custom message history size is not enough to properly skip
> duplicated custom messages that can be sent across the ring during server
> node reconnect. It leads to mentioned test failures due to duplication of
> user management operations (see GridDiscoveryManager#discoCacheHist,
> IGNITE_DISCOVERY_HISTORY_SIZE system property, and
> ServerImpl.RingMessageWorker#sendMessageAcrossRing).
> If the discovery history size will be increased significantly, the test stops
> failing and starts hanging. The steps that lead to this:
> 1. Client node sent UserProposedMessage across the ring while one node is
> offline due to reconnect.
> 2. Alive server nodes update their local user lists and finish the
> operation.
> 3. Reconnected node joins the ring and receives an updated user list from
> the coordinator.
> 4. Reconnected node receives duplicated UserProposedMessage that has been
> already handled by all nodes, handles it, and sents
> UserManagementOperationFinishedMessage to the coordinator and start to wait
> for the UserAcceptedMessage from it. But the coordinator has already finished
> this operation. So the thread that responsible for user management operation
> on the reconnected node becomes blocked (see
> IgniteAuthenticationProcessor.UserOperationWorker#body).
> 5. Client node starts the next operation that needs all alive nodes to
> respond with UserManagementOperationFinishedMessage. But reconnected node
> authentication thread is blocked. So this operation can't be completed at all.
> This issue causes all tests in the AuthenticationProcessorNodeRestartTest
> test class to be flaky.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)