merlimat opened a new pull request, #26009: URL: https://github.com/apache/pulsar/pull/26009
Fixes #25997 ### Motivation #25929 added an `explicitHostURI` pin to `ConnectionHandler` so the v5 transaction coordinator's metadata-store discovery re-dials the elected leader on retry instead of falling back to the service URL. The same change also persisted the broker-assigned redirect URL from `CommandCloseProducer`/`CommandCloseConsumer` unload notifications (PIP-307) into that pin: a producer/consumer that was redirected once stayed pinned to that address on every subsequent retry, never going through topic lookup again, and nothing ever cleared the field. A redirect to a wrong or stale broker — e.g. a stepping-down `ExtensibleLoadManagerImpl` leader whose `closeInternalTopics()` redirects internal-topic clients to itself via the stale channel-table assignment — now wedged the client permanently with `ServiceUnitNotReadyException` instead of recovering via lookup on the next retry. This is the cause of the `ExtensibleLoadManagerImplTest.testRoleChange` flakiness on master. Before #25929 the redirect was honored for the immediate reconnect attempt only, so one failed attempt was followed by a lookup that found the new owner. ### Modifications - `ConnectionHandler.connectionClosed()` updates the `explicitHostURI` pin only for handlers already in explicit-host mode, i.e. the TC metadata-store discovery, which entered it via `grabCnx(URI, boolean)`. The pin update is still needed there: `TransactionMetaStoreHandler.retargetLeader()` switches leaders by closing the channel and passing the new leader URI through `connectionClosed()`, and a failed first dial to the new leader must not retry against the old one. - Lookup-based handlers (producers/consumers) never enter explicit-host mode, so the broker-assigned redirect is honored for the immediate reconnect attempt only; if that attempt fails, retries fall back to topic lookup, restoring the pre-#25929 behavior. - Added a regression test that delivers a disconnect+redirect with an unreachable `assignedBrokerServiceUrl` to a live producer and consumer (exactly as `ClientCnx#handleCloseProducer`/`#handleCloseConsumer` do) and asserts both recover via lookup and can exchange a message. ### Verifying this change - [x] Make sure that the change passes the CI checks. This change added tests and can be verified as follows: - `BrokerClientIntegrationTest.testStaleBrokerRedirectFallsBackToLookup` fails on master without the fix (the clients stay pinned to the unreachable address) and passes with it. - `TransactionClientConnectTest` passes, covering the TC client connection path. - `ExtensibleLoadManagerImplTest.testRoleChange` with `invocationCount = 20` passes 40/40 invocations with the fix (the test factory runs both `ServiceUnitStateChannel` implementations), vs. 1-2 failures per 20 on master per #25997. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
