lhotari opened a new pull request, #25946: URL: https://github.com/apache/pulsar/pull/25946
Fixes #24357 ### Motivation `ExtensibleLoadManagerImplTest.testLoadBalancerServiceUnitTableViewSyncer` hung for the full 300s TestNG suite-default timeout in both attempts of CI run [27011985664](https://github.com/apache/pulsar/actions/runs/27011985664/job/79723694452), then cascaded into a later test's `@BeforeMethod initializeState()` failing with HTTP 500 (`The producer ... can not send message to the topic persistent://pulsar/system/loadbalancer-service-unit-state within given timeout`), skipping `testHandleNoChannelOwner`. Investigation (details in [this comment on #24357](https://github.com/apache/pulsar/issues/24357#issuecomment-4634015476)) found two production bugs in `ServiceUnitStateTableViewSyncer` that make `waitUntilSynced()` spin forever on entrySet-size divergence: 1. **Tombstone asymmetry** — a deletion is delivered to the syncer's put-based tail listeners as a `null` value. `ServiceUnitStateMetadataStoreTableViewImpl.put()` is `@NonNull` on the value, so the resulting NPE is silently swallowed by the table-view dispatchers and the deletion never propagates to the metadata-store view. (The reverse direction works only by accident: the system-topic view's `delete(key)` *is* `put(key, null)`.) 2. **Existing-vs-tail listener gap** — channel updates that land between the existing-items copy and the registration of the tail listeners in `syncTailItems()` are replayed to the freshly started views as *existing* items, which are wired to a no-op listener, so they never propagate. This is near-deterministic in the `MetadataStoreToSystemTopicSyncer` direction because closing the previous reader triggers a re-assignment of the channel-topic bundle itself into that gap (reproduced locally with the same sizes-off-by-one signature as CI, e.g. `MetadataStoreTableView.size: 44, SystemTopicTableView.size: 43`). Additionally, the leader-election-churning tests (`testRoleChange`, `testRoleChangeIdempotency`, `testHandleNoChannelOwner`) can leave the `pulsar/system/0x00000000_0xffffffff` bundle unserved (`Namespace bundle ... not served by this instance`), so the next test's namespace unload cannot publish its `Releasing` event and fails with HTTP 500 or hangs server-side. Nothing heals that state until the background monitor task (120s interval) reconciles the brokers' roles with the channel ownership — passive waiting/retrying alone cannot recover (reproduced locally and in personal CI). ### Modifications Production — `ServiceUnitStateTableViewSyncer` (a migration feature; no interface, call-site, or config changes): - Route `null` tail items to `delete()` instead of `put()`, and log failed sync futures at WARN (previously silently discarded by the dispatchers). - Make `waitUntilSynced()` reconcile periodically while diverged: copy source-only items to the destination (direction-aware via `dst.isMetadataStoreBased()`; raw metadata-store writes to avoid items being conflicted out, matching `syncExistingItemsToMetadataStore`) and remove destination-only items that predate the sync phase and are re-confirmed absent from the source (so a concurrent fresh write to the destination is never discarded). Batched at `MAX_CONCURRENT_SYNC_COUNT`; failures are logged and retried on the next pass; `InterruptedException` propagates. - Make the sync-wait budget an overridable field with a `@VisibleForTesting` setter (default unchanged at 300s) and add throttled DEBUG progress logging. - Code cleanup: shared `NOOP_CONSUMER`, `maybeWaitCompletion()` and `writeToMetadataStore()` helpers. Tests: - `testLoadBalancerServiceUnitTableViewSyncer`: cap at 240s (below the 300s suite default so a hang fails fast and is retried instead of consuming the whole slot), shrink the sync-wait budget to 30s before the first `start()`, drive `monitor()` inside the activation await, and tear the syncer down in a `finally` block so a failure cannot poison later tests. - `awaitChannelOwnerStable()` helper: after the leader-election-churning tests, drive `monitor()` to reconcile roles with channel ownership (the eager equivalent of the 120s background monitor task) and probe that the channel topic is looked up and served before yielding to the next test. `testHandleNoChannelOwner` additionally forces a deterministic leader in its cleanup, since a body failure mid-churn can otherwise leave leadership flapping between the brokers. Method timeouts for the churning tests raised from 30s to 60s. - `initializeState()`: drive `monitor()` and retry a bounded `unloadAsync` (15s per attempt) for up to 120s before failing loudly, healing a transiently unserved channel-topic bundle instead of failing the whole suite. ### Verifying this change - [x] Make sure that the change passes the CI checks. This change is already covered by existing tests, such as `ExtensibleLoadManagerImplTest` (both `@Factory` parametrizations exercise both syncer directions in `testLoadBalancerServiceUnitTableViewSyncer`). Verification performed: - Before the fix, the full class failed deterministically in the `MetadataStoreToSystemTopicSyncer` direction locally (twice, same signature as both CI attempts). After the fix, the syncer test completes in seconds and repeated full-class runs pass (102 tests, 0 failures), including runs where the residual pre-existing first-attempt flakes in `testRoleChange`/`testHandleNoChannelOwner` fired and were contained by the retry without cascading. - Personal CI on the fork: `Pulsar CI Flaky` (BROKER_FLAKY group) passed twice consecutively (lhotari/pulsar#225); the only full-pipeline failure was an unrelated `GeoReplicationTest` integration flake. ### Does this pull request potentially affect one of the following parts: *If the box was checked, please highlight the changes* - [ ] Dependencies (add or upgrade a dependency) - [ ] The public API - [ ] The schema - [ ] The default values of configurations - [ ] The threading model - [ ] The binary protocol - [ ] The REST endpoints - [ ] The admin CLI options - [ ] The metrics - [ ] Anything that affects deployment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
