lhotari opened a new pull request, #25977:
URL: https://github.com/apache/pulsar/pull/25977
Follow-up to #25946 (which first hardened `initializeState` for channel
disruption).
### Motivation
`ExtensibleLoadManagerImplTest.initializeState` (a `@BeforeMethod`)
intermittently fails with
`org.awaitility.core.ConditionTimeoutException: Assertion condition null
within 2 minutes.`
Root cause (confirmed from the failing run's test-report XML and thread
dump): the failing
`initializeState` runs immediately after a role-churning test such as
`testRoleChangeIdempotency`,
whose direct `playLeader()`/`playFollower()` calls can leave the channel
system topic
`persistent://pulsar/system/loadbalancer-service-unit-state` in an
*owns-but-doesn't-serve* state.
Clients looking it up get `"... not served by this
instance:localhost:<port>. Please redo the lookup."`
(logged ~120× over the 120s window) while the channel's `ServiceUnitState`
reports a *different*
owner, and the channel's internal producer falls into escalating reconnect
backoff (observed up to
61s) and never recovers.
The existing `monitor() + unload` retry cannot heal this: `monitor()` only
reconciles each broker's
*role* with the channel ownership, and the roles are already consistent in
this state — so the
unload's channel publish keeps failing for the full 120s and Awaitility
times out. (The
`Assertion condition null` wording is just Awaitility reporting that only
*ignored* / non-assertion
exceptions occurred; the thread dump confirmed no thread was actually stuck.)
### Modifications
`ExtensibleLoadManagerImplBaseTest.initializeState` now:
1. Tries a 30s fast path: drive `monitor()` and unload the test namespace
(unchanged behavior,
extracted into `awaitTestNamespaceUnloaded(...)`).
2. If that times out (the channel is wedged), calls a new
`recoverChannelOwnership()` and retries
the unload for 60s.
`recoverChannelOwnership()` forces a clean channel owner via leader
re-election: it closes the
current channel owner's `LeaderElectionService` so the other broker wins the
election; that broker's
`playLeader()` re-creates and re-serves the channel topic, and the ownership
change makes clients
redo their stale lookups. This reuses the exact LES-bounce pattern already
used by
`makePrimaryAsLeader()`. In the common (healthy) case the fast path succeeds
immediately and the
recovery never runs, so behavior is unchanged.
A now-stale comment in `ExtensibleLoadManagerImplTest` (describing the
`initializeState` backstop) is
updated to match.
### Verifying this change
- [ ] Make sure that the change passes the CI checks.
This change is already covered by existing tests
(`ExtensibleLoadManagerImplTest`, run in the flaky
group with `-PexcludedTestGroups=''`). Verified locally:
- `testRoleChangeIdempotency` + `testHandleNoChannelOwner` pass (both
`serviceUnitStateTableViewClassName` parametrizations).
- The recovery path was exercised by temporarily forcing it on *every*
`initializeState` invocation (LES bounce before each test method); all selected
tests still passed, confirming the recovery is benign in a healthy cluster.
- `spotlessJavaCheck` and `checkstyleTest` are green.
The original wedge is timing-dependent and was not reproduced locally;
confidence rests on the
artifact-confirmed root cause plus the above.
### Does this pull request potentially affect one of the following parts:
- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] The metrics
- [ ] Anything that affects deployment
This change is test-only (no production code changes).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]