[ 
https://issues.apache.org/jira/browse/IGNITE-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Petrov updated IGNITE-14301:
------------------------------------
    Description: 
First for all look at the test - 
AuthenticationProcessorNodeRestartTest#testConcurrentAddUpdateRemoveNodeRestartServer
 which is flaky - [TC 
history|https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-8873434544416175780&tab=testDetails]

The first problem with this test is that user management 
operations(add/update/remove) create too many discovery messages. So discovery 
custom message history size is not enough to properly skip duplicated custom 
messages that can be sent across the ring during server node reconnect. It 
leads to mentioned test failures due to duplication of user management 
operations (see GridDiscoveryManager#discoCacheHist, 
IGNITE_DISCOVERY_HISTORY_SIZE system property, and 
ServerImpl.RingMessageWorker#sendMessageAcrossRing).

If the discovery history size will be increased significantly, the test stops 
failing and starts hanging. The steps that lead to this:
 1. Client node sent UserProposedMessage across the ring while one node is 
offline due to reconnect. 
 2. Alive server nodes update their local user lists and finish the operation. 
 3. Reconnected node joins the ring and receives an updated user list from the 
coordinator.
 4. Reconnected node receives duplicated UserProposedMessage that has been 
already handled by all nodes, handles it, and sents 
UserManagementOperationFinishedMessage to the coordinator and start to wait for 
the UserAcceptedMessage from it. But the coordinator has already finished this 
operation. So the thread that responsible for user management operation on the 
reconnected node becomes blocked (see 
IgniteAuthenticationProcessor.UserOperationWorker#body).
 5. Client node starts the next operation that needs all alive nodes to respond 
with UserManagementOperationFinishedMessage. But reconnected node 
authentication thread is blocked. So this operation can't be completed at all.

This issue causes all tests in the AuthenticationProcessorNodeRestartTest test 
class to be flaky.

  was:
First for all look at the test - 
AuthenticationProcessorNodeRestartTest#testConcurrentAddUpdateRemoveNodeRestartServer
 which is flaky - [TC 
history|https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-8873434544416175780&tab=testDetails]

The first problem with this test is that user management 
operations(add/update/remove) create too many discovery messages. So discovery 
custom message history size is not enough to properly skip duplicated custom 
messages that can be sent across the ring during server node reconnect. It 
leads to mentioned test failures due to duplication of user management 
operations (see GridDiscoveryManager#discoCacheHist, 
IGNITE_DISCOVERY_HISTORY_SIZE system property, and 
ServerImpl.RingMessageWorker#sendMessageAcrossRing).

If the discovery history size will be increased significantly, the test stops 
failing and starts hanging. The steps that lead to this:
 1. Client node sent UserProposedMessage across the ring while one node is 
offline due to reconnect. 
 2. Alive server nodes update their local user lists and finish the operation. 
 3. Reconnected node joins the ring and receives an updated user list from the 
coordinator.
 4. Reconnected node receives duplicated UserProposedMessage that has been 
already handled by all nodes, handles it, and sents 
UserManagementOperationFinishedMessage to the coordinator and start to wait for 
the UserAcceptedMessage from it. But the coordinator has already finished this 
operation. So the thread that responsible for user management operation on the 
reconnected node becomes blocked (see 
IgniteAuthenticationProcessor.UserOperationWorker#body).
 5. Client node starts the next operation that needs all alive nodes to respond 
with UserManagementOperationFinishedMessage. But reconnected node 
authentication thread is blocked. So this operation can't be completed at all.


> Authentication processor can hang all user management operation after server 
> node reconnect
> -------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-14301
>                 URL: https://issues.apache.org/jira/browse/IGNITE-14301
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mikhail Petrov
>            Priority: Major
>
> First for all look at the test - 
> AuthenticationProcessorNodeRestartTest#testConcurrentAddUpdateRemoveNodeRestartServer
>  which is flaky - [TC 
> history|https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-8873434544416175780&tab=testDetails]
> The first problem with this test is that user management 
> operations(add/update/remove) create too many discovery messages. So 
> discovery custom message history size is not enough to properly skip 
> duplicated custom messages that can be sent across the ring during server 
> node reconnect. It leads to mentioned test failures due to duplication of 
> user management operations (see GridDiscoveryManager#discoCacheHist, 
> IGNITE_DISCOVERY_HISTORY_SIZE system property, and 
> ServerImpl.RingMessageWorker#sendMessageAcrossRing).
> If the discovery history size will be increased significantly, the test stops 
> failing and starts hanging. The steps that lead to this:
>  1. Client node sent UserProposedMessage across the ring while one node is 
> offline due to reconnect. 
>  2. Alive server nodes update their local user lists and finish the 
> operation. 
>  3. Reconnected node joins the ring and receives an updated user list from 
> the coordinator.
>  4. Reconnected node receives duplicated UserProposedMessage that has been 
> already handled by all nodes, handles it, and sents 
> UserManagementOperationFinishedMessage to the coordinator and start to wait 
> for the UserAcceptedMessage from it. But the coordinator has already finished 
> this operation. So the thread that responsible for user management operation 
> on the reconnected node becomes blocked (see 
> IgniteAuthenticationProcessor.UserOperationWorker#body).
>  5. Client node starts the next operation that needs all alive nodes to 
> respond with UserManagementOperationFinishedMessage. But reconnected node 
> authentication thread is blocked. So this operation can't be completed at all.
> This issue causes all tests in the AuthenticationProcessorNodeRestartTest 
> test class to be flaky.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to