[ 
https://issues.apache.org/jira/browse/IGNITE-23708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Efremov updated IGNITE-23708:
-------------------------------------
    Description: 
*Description*

During IGNITE-22036 there was found an issue with 
{{ItDisasterRecoveryReconfigurationTest#testAutomaticRebalanceIfMajorityIsLost}}
 test. It flacks in two possible places:

# On {{assertRealAssignments(node0, partId, 1)}} check with {{[0, 1, 2]}} 
actual result. The reason is the previous unfinished (as expected because 
majority of {{[1, 3, 4]}} is lost after 3 and 4 are stopped) rebalance that was 
triggired by scale-down timer starts new replication groups with the 
corresponding partition and the follow force reset rebalance with {{[1]}} 
couldn't finished before the assertion.
# On {{assertNull(getPendingAssignments(node0, partId))}} we still could have 
non-null pendings because there were non-forced planned assignments equals to 
{{[1]}} from partitions reset, and event if p.1 force-reset is done, the 
non-forced assignment rebalance may also be late a little.

The solution for 1 is just to increase the timeout inside 
{{assertRealAssignments}}. The solution for 2 is to check if reset assignments 
and planned equals -- then left the last as {{null}}, because there no any need 
in de facto the same reabalance. The test highlights this drawback in the 
implementation and it should be fixed.

*Motivation*

There shouldn't be any flacky tests and also the implementation is flawed and 
should be fixed.

*Definition of Done*

# {{assertRealAssignments}}'s timeout is increased from 2000ms up to 5000ms.
# Inside {{GroupUpdateRequest#partitionUpdate}} for nodes alive case we should 
check if {{partAssignments =}} stableAssignments= and then put {{null}} as 
planned assignments instead of {{partAssignments}}


  was:
*Description*

During IGNITE-22036 there was found an issue with
=ItDisasterRecoveryReconfigurationTest#testAutomaticRebalanceIfMajorityIsLost= 
test. It flacks in two possible places:

1. On =assertRealAssignments(node0, partId, 1)= check with =[0, 1, 2]= actual 
result. The reason is the previous
   unfinished (as expected because majority of =[1, 3, 4]= is lost after 3 and 
4 are stopped) rebalance that was
   triggired by scale-down timer starts new replication groups with the 
corresponding partition and the follow force
   reset rebalance with =[1]= couldn't finished before the assertion.
2. On =assertNull(getPendingAssignments(node0, partId))= we still could have 
non-null pendings because there were
   non-forced planned assignments equals to =[1]= from partitions reset, and 
event if p.1 force-reset is done, the
   non-forced assignment rebalance may also be late a little.

The solution for 1 is just to increase the timeout inside 
=assertRealAssignments=. The solution for 2 is to check if
reset assignments and planned equals -- then left the last as =null=, because 
there no any need in de facto the same
reabalance. The test highlights this drawback in the implementation and it 
should be fixed.

*Motivation*

There shouldn't be any flacky tests and also the implementation is flawed and 
should be fixed.

*Definition of Done*

1. =assertRealAssignments='s timeout is increased from 2000ms up to 5000ms.
2. Inside =GroupUpdateRequest#partitionUpdate= for nodes alive case we should 
check if =partAssignments ==
   stableAssignments= and then put =null= as planned assignments instead of 
=partAssignments=
  



> testAutomaticRebalanceIfMajorityIsLost is flacky
> ------------------------------------------------
>
>                 Key: IGNITE-23708
>                 URL: https://issues.apache.org/jira/browse/IGNITE-23708
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mikhail Efremov
>            Assignee: Mikhail Efremov
>            Priority: Major
>              Labels: ignite-3
>
> *Description*
> During IGNITE-22036 there was found an issue with 
> {{ItDisasterRecoveryReconfigurationTest#testAutomaticRebalanceIfMajorityIsLost}}
>  test. It flacks in two possible places:
> # On {{assertRealAssignments(node0, partId, 1)}} check with {{[0, 1, 2]}} 
> actual result. The reason is the previous unfinished (as expected because 
> majority of {{[1, 3, 4]}} is lost after 3 and 4 are stopped) rebalance that 
> was triggired by scale-down timer starts new replication groups with the 
> corresponding partition and the follow force reset rebalance with {{[1]}} 
> couldn't finished before the assertion.
> # On {{assertNull(getPendingAssignments(node0, partId))}} we still could have 
> non-null pendings because there were non-forced planned assignments equals to 
> {{[1]}} from partitions reset, and event if p.1 force-reset is done, the 
> non-forced assignment rebalance may also be late a little.
> The solution for 1 is just to increase the timeout inside 
> {{assertRealAssignments}}. The solution for 2 is to check if reset 
> assignments and planned equals -- then left the last as {{null}}, because 
> there no any need in de facto the same reabalance. The test highlights this 
> drawback in the implementation and it should be fixed.
> *Motivation*
> There shouldn't be any flacky tests and also the implementation is flawed and 
> should be fixed.
> *Definition of Done*
> # {{assertRealAssignments}}'s timeout is increased from 2000ms up to 5000ms.
> # Inside {{GroupUpdateRequest#partitionUpdate}} for nodes alive case we 
> should check if {{partAssignments =}} stableAssignments= and then put 
> {{null}} as planned assignments instead of {{partAssignments}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to