lhotari opened a new issue, #25999: URL: https://github.com/apache/pulsar/issues/25999
### Search before reporting - [x] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar. ### Read release policy - [x] I understand that [unsupported versions](https://pulsar.apache.org/contribute/release-policy/#supported-versions) don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker. ### User environment - master branch; the underlying behavior is long-standing (`LeaderElectionImpl` has read the leader value through a `MetadataCache` since the metadata-store refactor), so supported release branches are affected as well - Broker-side issue; no specific client involved ### Issue Description **Scenario.** When the leader is lost — the elected broker shuts down, crashes, loses its metadata session, or hands off leadership — the leader's ephemeral node is deleted and the remaining participants re-elect. While that re-election is settling, reading the current leader gives wrong answers on participants that did not just run their own `elect()`: - `LeaderElectionService.getCurrentLeader()` → `LeaderElection.getLeaderValueIfPresent()` returns `Optional.empty()` even though a new leader is being (or has been) elected. The cache entry backing it is invalidated by the `Deleted` notification and is only repopulated by the next loading read — in the worst case by the periodic refresh task, up to 5 minutes later. - `LeaderElectionService.readCurrentLeader()` → `LeaderElection.getLeaderValue()` is no better as an authoritative read: it is just `cache.get(path)`, so it reflects whatever the cache/store has at that instant, not the settled election outcome. Mid-handoff it can return empty or the pre-handoff leader. In short, there is currently **no authoritative way to ask "who is the leader"** — both reads are snapshots of a cache whose lifecycle is disconnected from the election state machine. **Use cases where the authoritative leader is required**, and what goes wrong today when the leader is lost: - `BrokersBase.getLeaderBroker` (REST `GET /brokers/leaderBroker`): returns 404 "Couldn't find leader broker" although a leader exists or the election settles a moment later. - `NamespacesBase.validateLeaderBrokerAsync`: fails the request with `412 PRECONDITION_FAILED ("The current leader is empty.")` instead of redirecting to the actual leader (it also calls `getCurrentLeader()` twice). - `NamespaceService.searchForCandidateBroker` (modular load manager path): an empty leader makes `leaderBrokerActive == false`, so each broker falls back to making the load-manager decision locally — several brokers can do this concurrently during the same handoff window. - `ServiceUnitStateChannelImpl` (extensible load manager): the channel owner *is* the elected leader; `getChannelOwnerAsync()`/`isChannelOwner()` go through `readCurrentLeader()`. During a handoff the unsettled answer surfaces as "There is no channel owner now" errors, triggering recovery paths (leader election restarts) that add churn instead of simply waiting for the election to complete. **Root cause.** `LeaderElectionImpl` keeps the leader value in a `MetadataCache` (`expireAfterWrite = -1`, plus a 5-minute periodic read), updated by store notifications rather than by the election cycle. The cache is most likely unnecessary: the leader can only change through an election cycle (elect / observe existing leader / leader-node deletion → re-elect), so the election state machine itself always knows the leader value at the moments it changes. The cache historically also served to register the metadata watch ("do a `get()` in order to force a notification later"), which is obsolete now that `ZKMetadataStore` uses a persistent recursive watch (`AddWatchMode.PERSISTENT_RECURSIVE`). ### Proposed fix Refactor `LeaderElectionImpl` to keep the current-leader value as part of the leader election cycle, and give the two reads distinct, documented semantics: - **`getLeaderValue()` / `readCurrentLeader()` — authoritative:** if an election is in progress (leader unknown), the returned future completes once the election settles, with the newly determined leader. To guard against an election that never completes, the wait times out after the default metadata operation timeout (30s), completing exceptionally with a `TimeoutException`. - **`getLeaderValueIfPresent()` / `getCurrentLeader()` — snapshot:** non-blocking; may be empty while a re-election is settling; only suitable for best-effort uses such as logging. - The `MetadataCache` and the periodic refresh task are removed; the leader value is set when the election settles (`Leading` with the proposed value, `Following` with the observed existing value) and cleared to "unknown" when the leader node is deleted. Call sites then need converting to the authoritative read where a decision is made: - `BrokersBase.getLeaderBroker` and `NamespacesBase.validateLeaderBrokerAsync` are already in async contexts — straightforward conversions to `readCurrentLeader()`. - `NamespaceService.searchForCandidateBroker` is synchronous today (blocking `.get()` calls inside), so it would have to be converted to fully async before it can use `readCurrentLeader()`. - `PulsarService`'s election-state listener only logs the leader and can stay on the snapshot. A prototype of this refactoring passes the existing leader-election and coordination test suites (`LeaderElectionTest` 35/35). Two integration points need explicit care in the final change: - Reads on a **closed** election should report "no leader" (empty) rather than fail or wait, because the extensible load manager's no-channel-owner recovery (`handleNoChannelOwnerError`) keys off the resulting "There is no channel owner now" condition to restart the election. - Synchronous wrappers such as the BookKeeper auditor's `PulsarLedgerAuditorManager.getCurrentAuditor()` (which `join()`s the read) will now block for up to the timeout while an election is unsettled; callers and tests polling them need to account for that. ### Reproducing the issue Kill or close the current leader broker (or close and restart its `LeaderElectionService`, as `ExtensibleLoadManagerImplTest` does) and concurrently call `GET /brokers/leaderBroker` or any admin operation that goes through `validateLeaderBrokerAsync`: during the re-election window the calls fail with 404/412 although a leader is elected moments later. The same window is observable directly at the API level: after the leader node is deleted and recreated, `getLeaderValueIfPresent()` on a non-electing participant stays empty until the next loading read (up to 5 minutes via the periodic refresh). Found while investigating flaky `ExtensibleLoadManagerImplTest.testRoleChange` runs (the client-side part of that investigation is tracked separately in #25997). ### Are you willing to submit a PR? - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
