[PR] [fix][test] Fix flaky ExtensibleLoadManagerImplTest.initializeState by recovering wedged channel ownership [pulsar]

via GitHub Mon, 08 Jun 2026 10:30:46 -0700


lhotari opened a new pull request, #25977:
URL: https://github.com/apache/pulsar/pull/25977


   Follow-up to #25946 (which first hardened `initializeState` for channel 
disruption).
   
   ### Motivation
   
   `ExtensibleLoadManagerImplTest.initializeState` (a `@BeforeMethod`) 
intermittently fails with
   `org.awaitility.core.ConditionTimeoutException: Assertion condition null 
within 2 minutes.`
   
   Root cause (confirmed from the failing run's test-report XML and thread 
dump): the failing
   `initializeState` runs immediately after a role-churning test such as 
`testRoleChangeIdempotency`,
   whose direct `playLeader()`/`playFollower()` calls can leave the channel 
system topic
   `persistent://pulsar/system/loadbalancer-service-unit-state` in an 
*owns-but-doesn't-serve* state.
   Clients looking it up get `"... not served by this 
instance:localhost:<port>. Please redo the lookup."`
   (logged ~120× over the 120s window) while the channel's `ServiceUnitState` 
reports a *different*
   owner, and the channel's internal producer falls into escalating reconnect 
backoff (observed up to
   61s) and never recovers.
   
   The existing `monitor() + unload` retry cannot heal this: `monitor()` only 
reconciles each broker's
   *role* with the channel ownership, and the roles are already consistent in 
this state — so the
   unload's channel publish keeps failing for the full 120s and Awaitility 
times out. (The
   `Assertion condition null` wording is just Awaitility reporting that only 
*ignored* / non-assertion
   exceptions occurred; the thread dump confirmed no thread was actually stuck.)
   
   ### Modifications
   
   `ExtensibleLoadManagerImplBaseTest.initializeState` now:
   
   1. Tries a 30s fast path: drive `monitor()` and unload the test namespace 
(unchanged behavior,
      extracted into `awaitTestNamespaceUnloaded(...)`).
   2. If that times out (the channel is wedged), calls a new 
`recoverChannelOwnership()` and retries
      the unload for 60s.
   
   `recoverChannelOwnership()` forces a clean channel owner via leader 
re-election: it closes the
   current channel owner's `LeaderElectionService` so the other broker wins the 
election; that broker's
   `playLeader()` re-creates and re-serves the channel topic, and the ownership 
change makes clients
   redo their stale lookups. This reuses the exact LES-bounce pattern already 
used by
   `makePrimaryAsLeader()`. In the common (healthy) case the fast path succeeds 
immediately and the
   recovery never runs, so behavior is unchanged.
   
   A now-stale comment in `ExtensibleLoadManagerImplTest` (describing the 
`initializeState` backstop) is
   updated to match.
   
   ### Verifying this change
   
   - [ ] Make sure that the change passes the CI checks.
   
   This change is already covered by existing tests 
(`ExtensibleLoadManagerImplTest`, run in the flaky
   group with `-PexcludedTestGroups=''`). Verified locally:
   - `testRoleChangeIdempotency` + `testHandleNoChannelOwner` pass (both 
`serviceUnitStateTableViewClassName` parametrizations).
   - The recovery path was exercised by temporarily forcing it on *every* 
`initializeState` invocation (LES bounce before each test method); all selected 
tests still passed, confirming the recovery is benign in a healthy cluster.
   - `spotlessJavaCheck` and `checkstyleTest` are green.
   
   The original wedge is timing-dependent and was not reproduced locally; 
confidence rests on the
   artifact-confirmed root cause plus the above.
   
   ### Does this pull request potentially affect one of the following parts:
   
   - [ ] Dependencies (add or upgrade a dependency)
   - [ ] The public API
   - [ ] The schema
   - [ ] The default values of configurations
   - [ ] The threading model
   - [ ] The binary protocol
   - [ ] The REST endpoints
   - [ ] The admin CLI options
   - [ ] The metrics
   - [ ] Anything that affects deployment
   
   This change is test-only (no production code changes).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [fix][test] Fix flaky ExtensibleLoadManagerImplTest.initializeState by recovering wedged channel ownership [pulsar]

Reply via email to