Anshul-creator opened a new pull request, #20645:
URL: https://github.com/apache/kafka/pull/20645
# [KAFKA-18981] Deflake `testMinIsrUpdateWithElr` by maintaining broker
heartbeats during waits
### What
This PR removes nondeterminism in `testMinIsrUpdateWithElr` by ensuring
broker sessions do not expire while the test is waiting/sleeping, through a
lightweight background heartbeat pumper.
### Reproduction of flakiness
The test is stable until you insert a sleep equal to the controller session
timeout immediately after topic creation. With the original code:
1. In `testMinIsrUpdateWithElr`, add the line **right after**:
```java
CreateTopicsResponseData createTopicsResponseData = active.createTopics(
ANONYMOUS_CONTEXT, createTopicsRequestData, Set.of("foo",
"bar")).get();
```
Insert:
```java
Thread.sleep(sessionTimeoutMillis);
```
2. Run the test loop a few times:
```bash
pass=0; fail=0; for i in {1..5}; do \
echo "=== Run $i ==="; \
./gradlew :metadata:test \
--tests 'org.apache.kafka.controller.QuorumControllerTest' \
-Pkafka.test.run.flaky=true \
-PmaxParallelForks=1 -Dorg.gradle.workers.max=1 -Dtest.forkEvery=1 \
--rerun-tasks --no-daemon --no-build-cache && pass=$((pass+1)) ||
fail=$((fail+1)); \
done; echo "Summary: $pass pass / $fail fail"
```
### Why this flakes
- The controller fences brokers that miss heartbeats for
`sessionTimeoutMillis`.
- After initially unfencing all brokers, the test performs a sleep/wait
before beginning the fencing phase.
- That pause allows the survivor broker (id `1`) to lapse its session and
get fenced transiently.
- Subsequent ISR/ELR assertions sometimes observe this unintended state,
causing nondeterministic failures.
### How (the fix)
Add a minimal **background heartbeat pumper** inside the test:
- Schedule rate: `periodMs = max(50ms, sessionTimeoutMillis / 3)` to
guarantee multiple heartbeats per timeout window.
- Phase 1 (stabilize): heartbeat **all** brokers while creating topics and
applying the initial broker-level config.
- Phase 2 (fencing): set a flag to heartbeat **only**
`brokersToKeepUnfenced` right before `waitForCondition` that fences brokers 2
and 3.
This preserves the test’s intent (only broker `1` remains unfenced; brokers
`2,3` fence) while eliminating timing races introduced by sleeps/awaits.
### Verification of stability
With the heartbeat pumper in place **and the flakiness sleep enabled**, the
loop below completed without failures:
```bash
pass=0; fail=0; for i in {1..50}; do \
echo "=== Run $i ==="; \
./gradlew :metadata:test \
--tests 'org.apache.kafka.controller.QuorumControllerTest' \
-Pkafka.test.run.flaky=true \
-PmaxParallelForks=1 -Dorg.gradle.workers.max=1 -Dtest.forkEvery=1 \
--rerun-tasks --no-daemon --no-build-cache && pass=$((pass+1)) ||
fail=$((fail+1)); \
done; echo "Summary: $pass pass / $fail fail"
```
**Note:** I ran the loop 50 times (compared to the 5 times in reproducing
the flakiness case) to increase trust in the fix
### Scope
- **Test-only change.**
- No production behavior/APIs altered.
- Keeps existing assertions and ISR/ELR expectations intact.
### JIRA
[KAFKA-18981](https://issues.apache.org/jira/browse/KAFKA-18981)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]