zpinto commented on PR #2451:
URL: https://github.com/apache/helix/pull/2451#issuecomment-1511686142

   This PR is ready to merge, approved by @xyuanlu and @qqu0127
   
   Commit message:
   Fixing Flaky Integration Test TestClusterAccessor#testClusterFreeze
   
   What is this test doing?
   
   This integration test is making a POST request to helix-rest to place 
cluster TestCluster_0 into CLUSTER_FREEZE status. It then attempts to verify 
that the following GET request for clusterStatus is in CLUSTER_FREEZE mode.
   
   Why is this test likely flaky?
   
   Following the POST request, we check that the pauseSignal which we added 
exists under the CONTROLLER znode, this will succeed because this znode is 
written to before the POST responds.
   
   The next test is to see if the clusterMode is in CLUSTER_FREEZE. We also 
check if the clusterStatus is either IN_PROGRESS or COMPLETED(meaning all state 
transitions are either canceled or completed and participants are now 
frozen/paused). The reason the GET request can sometimes return clusterMode == 
NORMAL is because the verification we are doing before we make the GET request 
is not a definitive signal that the PAUSE signal has caused the clusterMode to 
change to CLUSTER_FREEZE. This is because a clusterPause signal/event has to be 
enqueued to the _managementModeEventQueue and processed before we make the GET 
for clusterStatus. Until it has been processed, the clusterStatus will be 
NORMAL.
   
   Previously we were running this verification to check if the clusterStatus 
znode existed.
   
   This will not necessarily indicate that the pause event has been processed, 
as there could be an earlier event passed through the _managementEventPipeline 
that causes the znode to be created with clusterMode == NORMAL. This would lead 
us to make the GET request too early.
   
   Instead, we will verify that the clusterStatus znode has CLUSTER_FREEZE mode 
before we make the GET request. This will ensure that the pause event was 
already processed.
   
   - fixes [Failed CI Test] 
testClusterFreezeMode(org.apache.helix.rest.server.TestClusterAccessor) #2404
   - fixes [Failed CI Test] 
testClusterFreezeMode(org.apache.helix.rest.server.TestClusterAccessor) #2229
   - fixes [Failed CI Test] 
testClusterFreezeMode(org.apache.helix.rest.server.TestClusterAccessor) #1979
   - fixes [Failed CI Test] 
testClusterFreezeMode(org.apache.helix.rest.server.TestClusterAccessor) #1832
   
   Testing Done:
   
   Run the integration test 10 times ./scripts/runSingleTest.sh 
TestClusterAccessor#testClusterFreezeMode 10 helix-rest


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to