GrantPSpencer opened a new pull request, #2705:
URL: https://github.com/apache/helix/pull/2705
### Issues
- [ ] My PR addresses the following Helix issues and references them in the
PR description:
#2693 [Failed CI Test] testCacheDataUpdates
### Description
- [ ] Here are some details about my PR, including screenshots of any UI
changes:
Metaclient cache utilizes ZK watches to populate its data, this means there
can be a lag time between when an operation occurs and when that operation
affects the cache. The testCacheDataUpdates was creating a node
`zkMetaClientCache.create(key + DATA_PATH, DATA_VALUE)` and then immediately
retrieving it `zkMetaClientCache.get(key + DATA_PATH)` . This get() call would
actually return null (so data = null) and the subsequent assertion:
`Assert.assertEquals(data, zkMetaClientCache.getDataCacheMap().get(key +
DATA_PATH))` would complete as the value had not been populated in the
DataCacheMap either and would evaluate to `assertEquals(null, null)`
The subsequent test would then fail as we were using the stale `data` value
of null when comparing it to the value in the cache. If the cache had been
updated, then this assertion would fail. If the cache had not been updated,
then the assertion would pass, explaining the flakiness.
The first assertion has been changed to also use the
MetaClientTestUtil.verify() method which will repeatedly check until timeout,
giving time for the cache to successfully update.
Both assertions have been changed to expect DATA_VALUE as the znode value,
to prevent checking against a possibly stale value.
---
I was able to **inconsistently** reproduce this test by setting
testCacheDataUpdates to be run last by setting its priority =1 (default is 0):
```
@Test (priority = 1)
public void testCacheDataUpdates() {
```
My assumption is that the failure is more likely to occur when the time from
the create request being sent to the watch being triggered is increased. The
testLargeClusterLoading method sends 1600 create requests to the ZK server,
likely putting it under some load. If testCacheDataUpdates occurs afterwards,
then maybe the ZK server is and so failure likelihood is increased.
If anyone is able to consistently reproduce this, then that would be very
helpful.
### Tests
- [ ] The following tests are written for this issue:
testCacheDataUpdates
- The following is the result of the "mvn test" command on the appropriate
module:
```
$ mvn test -Dtest=TestZkMetaClientCache -pl=meta-client
[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
11.085 s - in org.apache.helix.metaclient.impl.zk.TestZkMetaClientCache
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO]
[INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ meta-client
---
[INFO] Loading execution data file
/Users/gspencer/Desktop/git-repos/helix/meta-client/target/jacoco.exec
[INFO] Analyzed bundle 'Apache Helix :: Meta Client' with 78 classes
[INFO]
------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 14.122 s
[INFO] Finished at: 2023-11-22T12:16:19-08:00
[INFO]
------------------------------------------------------------------------
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]