[ https://issues.apache.org/jira/browse/IGNITE-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313867#comment-17313867 ]
Mikhail Petrov commented on IGNITE-14301: ----------------------------------------- It seems that the https://issues.apache.org/jira/browse/IGNITE-14375 fixes this issue. > Authentication processor can hang all user management operation after server > node reconnect > ------------------------------------------------------------------------------------------- > > Key: IGNITE-14301 > URL: https://issues.apache.org/jira/browse/IGNITE-14301 > Project: Ignite > Issue Type: Bug > Reporter: Mikhail Petrov > Priority: Major > > First for all look at the test - > AuthenticationProcessorNodeRestartTest#testConcurrentAddUpdateRemoveNodeRestartServer > which is flaky - [TC > history|https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-8873434544416175780&tab=testDetails] > The first problem with this test is that user management > operations(add/update/remove) create too many discovery messages. So > discovery custom message history size is not enough to properly skip > duplicated custom messages that can be sent across the ring during server > node reconnect. It leads to mentioned test failures due to duplication of > user management operations (see GridDiscoveryManager#discoCacheHist, > IGNITE_DISCOVERY_HISTORY_SIZE system property, and > ServerImpl.RingMessageWorker#sendMessageAcrossRing). > If the discovery history size will be increased significantly, the test stops > failing and starts hanging. The steps that lead to this: > 1. Client node sent UserProposedMessage across the ring while one node is > offline due to reconnect. > 2. Alive server nodes update their local user lists and finish the > operation. > 3. Reconnected node joins the ring and receives an updated user list from > the coordinator. > 4. Reconnected node receives duplicated UserProposedMessage that has been > already handled by all nodes, handles it, and sents > UserManagementOperationFinishedMessage to the coordinator and start to wait > for the UserAcceptedMessage from it. But the coordinator has already finished > this operation. So the thread that responsible for user management operation > on the reconnected node becomes blocked (see > IgniteAuthenticationProcessor.UserOperationWorker#body). > 5. Client node starts the next operation that needs all alive nodes to > respond with UserManagementOperationFinishedMessage. But reconnected node > authentication thread is blocked. So this operation can't be completed at all. > This issue causes all tests in the AuthenticationProcessorNodeRestartTest > test class to be flaky. -- This message was sent by Atlassian Jira (v8.3.4#803005)