merlimat opened a new pull request, #25976: URL: https://github.com/apache/pulsar/pull/25976
### Motivation The `Pulsar CI Flaky` suite has been failing frequently on `ExtensibleLoadManagerImplTest`, with a 2-minute timeout in the `initializeState` `@BeforeMethod` (example run: https://github.com/apache/pulsar/actions/runs/27144684649): ``` org.awaitility.core.ConditionTimeoutException: Assertion condition null within 2 minutes. at ...ExtensibleLoadManagerImplBaseTest.initializeState(ExtensibleLoadManagerImplBaseTest.java) ``` **Root cause.** Tests such as `testHandleNoChannelOwner` deliberately churn leader election by closing the `LeaderElectionService` on both brokers. This can leave the channel-topic bundle `pulsar/system/0x00000000_0xffffffff` (which hosts `loadbalancer-service-unit-state`) in an *owner-recorded-but-not-actually-served* state. Every channel operation then fails with `... not served by this instance ... Please redo the lookup`. `initializeState` (reworked in #25946) drives `monitor()` and retries the namespace unload for 120s, but `monitor()` cannot heal this particular state: `ExtensibleLoadManagerImpl.handleNoChannelOwnerError` only restarts leader election when the channel reports *"no channel owner now"*. When an owner **is** recorded but refuses to serve, no such error is thrown, recovery never triggers, and the unload — which must publish to the channel topic — can never succeed. The 120s budget is exhausted and the `@BeforeMethod` fails, cascading to skipped tests. ### Modifications In `ExtensibleLoadManagerImplBaseTest.initializeState`, force-serve the channel topic inside the existing retry loop, before the unload: - `admin.lookups().lookupTopic(...)` re-assigns the `pulsar/system` bundle, and - `admin.topics().getStats(...)` forces the recorded owner to actually load it (the lookup layer alone can claim an owner that refuses to serve). This is the same sequence `awaitChannelOwnerStable()` already uses to stabilize after churn, but run on **every** retry attempt so the channel is re-served immediately before each unload — rather than only once in the churn test's `finally`, where the state can degrade again before the next `initializeState`. It is guarded to the `ServiceUnitStateTableViewImpl` (system-topic) variant, matching `awaitChannelOwnerStable`'s own guard; the metadata-store variant has no channel *topic* to serve. This is a test-side mitigation. The durable fix is product-side — teaching `monitor()` / `handleNoChannelOwnerError` to detect *owner-recorded-but-unserved* and re-assign the bundle — and can follow separately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
