GrantPSpencer opened a new pull request, #2705:
URL: https://github.com/apache/helix/pull/2705

   ### Issues
   
   - [ ] My PR addresses the following Helix issues and references them in the 
PR description:
   
   #2693 [Failed CI Test] testCacheDataUpdates
   
   ### Description
   
   - [ ] Here are some details about my PR, including screenshots of any UI 
changes:
   Metaclient cache utilizes ZK watches to populate its data, this means there 
can be a lag time between when an operation occurs and when that operation 
affects the cache. The testCacheDataUpdates was creating a node 
`zkMetaClientCache.create(key + DATA_PATH, DATA_VALUE)` and then immediately 
retrieving it `zkMetaClientCache.get(key + DATA_PATH)` . This get() call would 
actually return null (so data = null) and the subsequent assertion:
   `Assert.assertEquals(data, zkMetaClientCache.getDataCacheMap().get(key + 
DATA_PATH))` would complete as the value had not been populated in the 
DataCacheMap either and would evaluate to `assertEquals(null, null)`
   
   The subsequent test would then fail as we were using the stale `data` value 
of null when comparing it to the value in the cache. If the cache had been 
updated, then this assertion would fail. If the cache had not been updated, 
then the assertion would pass, explaining the flakiness.  
   
   The first assertion has been changed to also use the 
MetaClientTestUtil.verify() method which will repeatedly check until timeout, 
giving time for the cache to successfully update. 
   Both assertions have been changed to expect DATA_VALUE as the znode value, 
to prevent checking against a possibly stale value. 
   
   ---
   I was able to **inconsistently** reproduce this test by setting 
testCacheDataUpdates to be run last by setting its priority =1 (default is 0):
   ```
       @Test (priority = 1)
       public void testCacheDataUpdates() {
   ```
   My assumption is that the failure is more likely to occur when the time from 
the create request being sent to the watch being triggered is increased. The 
testLargeClusterLoading method sends 1600 create requests to the ZK server, 
likely putting it under some load. If testCacheDataUpdates occurs afterwards, 
then maybe the ZK server is and so failure likelihood is increased. 
   
   If anyone is able to consistently reproduce this, then that would be very 
helpful. 
   
   ### Tests
   
   - [ ] The following tests are written for this issue:
   
   testCacheDataUpdates
   
   - The following is the result of the "mvn test" command on the appropriate 
module:
   
   ```
   $  mvn test -Dtest=TestZkMetaClientCache -pl=meta-client
   
   [INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
11.085 s - in org.apache.helix.metaclient.impl.zk.TestZkMetaClientCache
   [INFO] 
   [INFO] Results:
   [INFO] 
   [INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0
   [INFO] 
   [INFO] 
   [INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ meta-client 
---
   [INFO] Loading execution data file 
/Users/gspencer/Desktop/git-repos/helix/meta-client/target/jacoco.exec
   [INFO] Analyzed bundle 'Apache Helix :: Meta Client' with 78 classes
   [INFO] 
------------------------------------------------------------------------
   [INFO] BUILD SUCCESS
   [INFO] 
------------------------------------------------------------------------
   [INFO] Total time:  14.122 s
   [INFO] Finished at: 2023-11-22T12:16:19-08:00
   [INFO] 
------------------------------------------------------------------------
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to