merlimat opened a new pull request, #25715:
URL: https://github.com/apache/pulsar/pull/25715
### Motivation
`ServiceUnitStateChannel.start()` is invoked via
`pulsar.runWhenReadyForIncomingRequests(...)`, so the broker accepts HTTP
requests before the channel reaches `Started`. During that window any topic
lookup hits `getOwnerAsync` which fails immediately with:
```
java.lang.IllegalStateException: Invalid channel
state:LeaderElectionServiceStarted
at
o.a.p.b.l.e.channel.ServiceUnitStateChannelImpl.getOwnerAsync(ServiceUnitStateChannelImpl.java:539)
```
The `@BeforeMethod startBroker` previously included a
`lookupPartitionedTopic` probe. After a broker restart that probe hit the
channel-startup race: each poll iteration failed fast with the channel-state
error and the loop just spun until the 180s budget expired, without giving the
channel any time to actually finish initialization. Under CI resource
contention, `channel.start()` (driven by `tableview.fill()` loading existing
bundles) can take 60–90s, exceeding the awaitility budget once a follow-on
`deferGetOwner` 30s timeout is layered on top.
Failure trace observed on CI for cluster `MultiLoadManagerTest-ee82e900-…`:
```
21:08:23 WARN Broker is not ready yet {broker=pulsar-broker-1, yet=
--- An unexpected error occurred in the server ---
Message: Invalid channel state:LeaderElectionServiceStarted
…
21:11:39 WARN Broker is not ready yet {broker=pulsar-broker-1,
yet=java.util.concurrent.TimeoutException}
…
21:13:36 WARN Broker is not ready yet {broker=pulsar-broker-1,
yet=java.lang.InterruptedException}
```
### Modifications
Drop the `createPartitionedTopic` + `lookupPartitionedTopic` probe from
`@BeforeMethod startBroker`. The remaining `getActiveBrokers().size() ==
NUM_BROKERS` check already verifies the broker is reachable and sees the
cluster, which is what the `@BeforeMethod` actually needs to assert before the
next test runs.
The underlying broker-side race (`getOwnerAsync` failing immediately rather
than waiting for `Started`) is a separate broker-level issue and is
intentionally out of scope here.
### Verifying this change
This change is a trivial test fix; the existing tests cover the load-manager
behavior.
### Does this pull request potentially affect one of the following parts:
- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] The metrics
- [ ] Anything that affects deployment
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]