[ 
https://issues.apache.org/jira/browse/IGNITE-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313867#comment-17313867
 ] 

Mikhail Petrov commented on IGNITE-14301:
-----------------------------------------

It seems that the https://issues.apache.org/jira/browse/IGNITE-14375 fixes this 
issue.

> Authentication processor can hang all user management operation after server 
> node reconnect
> -------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-14301
>                 URL: https://issues.apache.org/jira/browse/IGNITE-14301
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mikhail Petrov
>            Priority: Major
>
> First for all look at the test - 
> AuthenticationProcessorNodeRestartTest#testConcurrentAddUpdateRemoveNodeRestartServer
>  which is flaky - [TC 
> history|https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-8873434544416175780&tab=testDetails]
> The first problem with this test is that user management 
> operations(add/update/remove) create too many discovery messages. So 
> discovery custom message history size is not enough to properly skip 
> duplicated custom messages that can be sent across the ring during server 
> node reconnect. It leads to mentioned test failures due to duplication of 
> user management operations (see GridDiscoveryManager#discoCacheHist, 
> IGNITE_DISCOVERY_HISTORY_SIZE system property, and 
> ServerImpl.RingMessageWorker#sendMessageAcrossRing).
> If the discovery history size will be increased significantly, the test stops 
> failing and starts hanging. The steps that lead to this:
>  1. Client node sent UserProposedMessage across the ring while one node is 
> offline due to reconnect. 
>  2. Alive server nodes update their local user lists and finish the 
> operation. 
>  3. Reconnected node joins the ring and receives an updated user list from 
> the coordinator.
>  4. Reconnected node receives duplicated UserProposedMessage that has been 
> already handled by all nodes, handles it, and sents 
> UserManagementOperationFinishedMessage to the coordinator and start to wait 
> for the UserAcceptedMessage from it. But the coordinator has already finished 
> this operation. So the thread that responsible for user management operation 
> on the reconnected node becomes blocked (see 
> IgniteAuthenticationProcessor.UserOperationWorker#body).
>  5. Client node starts the next operation that needs all alive nodes to 
> respond with UserManagementOperationFinishedMessage. But reconnected node 
> authentication thread is blocked. So this operation can't be completed at all.
> This issue causes all tests in the AuthenticationProcessorNodeRestartTest 
> test class to be flaky.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to