merlimat opened a new pull request, #25675: URL: https://github.com/apache/pulsar/pull/25675
### Motivation `SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover` waits for the failover state machine to converge across three phases (failover 0→2, recover 2→1, recover 1→0). Each phase uses an `Awaitility.untilAsserted(...)` lambda that combined three assertions: 1. The per-index `pulsarServiceStateArray` matches the expected states. 2. `producer.send(...)` succeeds. 3. `failover.getCurrentPulsarServiceIndex()` returns the expected index. When the failover state has converged but the producer's underlying connection is still being re-established (`updateServiceUrl` calls `cnxPool.closeAllConnections()`), the `producer.send(...)` retry inside the lambda can stall up to the producer's send timeout (~30s). Each retry of the lambda then burns ~30s of the per-phase budget, even though the failover state machine itself already settled. On slow CI agents this causes the per-phase 120s budget to time out at phase 3 with `expected [true] but found [false]`. Example failure: https://scans.gradle.com/s/xiv7nu4ujnh5c/tests/task/:pulsar-broker:test/details/org.apache.pulsar.broker.SameAuthParamsLookupAutoClusterFailoverTest/testAutoClusterFailover%5B4%5D(false)/1/output ### Modifications Split the convergence check from the side checks per phase: - Wait inside `Awaitility.untilAsserted(...)` only for the per-index state and `currentPulsarServiceIndex` (cheap reads on the failover executor). - Move `producer.send(...)` outside the await loop so it runs once per phase and surfaces send failures directly. Also extracted small helpers (`awaitStatesAndIndex`, `assertStatesEqual`) to remove the repetitive submit-future-join boilerplate, and bumped the per-phase budget to 180s with an overall 12-minute timeout (the probe timeout is 3s and a single failed probe during recovery resets `recoverThreshold`, so a phase can need up to ~30s of healthy probes to recover). ### Verifying this change This change is already covered by existing tests: `SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover` (TLS and non-TLS variants). Locally I ran 3 times in a row with fresh Gradle daemons; each run took ~17s per variant and all passed. ### Does this pull request potentially affect one of the following parts: - [ ] Dependencies (add or upgrade a dependency) - [ ] The public API - [ ] The schema - [ ] The default values of configurations - [ ] The threading model - [ ] The binary protocol - [ ] The REST endpoints - [ ] The admin CLI options - [ ] The metrics - [ ] Anything that affects deployment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
