lhotari commented on issue #25678: URL: https://github.com/apache/pulsar/issues/25678#issuecomment-4383473869
# Root cause: V5ProducerAccessModeTest.testExclusiveRejectsSecondProducer **Issue:** apache/pulsar#25678 **Observed in:** Pulsar CI run 25387350198, attempt 1, job `CI - Unit - Brokers - Broker Group 2` **Build scan:** https://gradle.com/s/jnnlhs2ddi5h4 **PR context:** #25675 (fork PR; the stated subject was an unrelated test fix — this flake is independent) ## Symptom The test asserts that a second `EXCLUSIVE` `v5Client.newProducer(...).create()` on the same scalable topic must throw `PulsarClientException` (the V5 builder is documented to "eagerly claim every active segment up front for exclusive access modes"). In the flaking run, the second `create()` returns a `Producer` instead of throwing, so the test trips on `fail("a second EXCLUSIVE producer should not be able to attach")` at `V5ProducerAccessModeTest.java:92`. TestNG retries once; the retry behaves identically. Both attempts complete in well under 60 ms, which is far too fast for any normal broker-side fence to round-trip. ## What the broker logs show Broker logs from the failing job (extracted from the `Unit-BROKER_GROUP_2-test-reports` artifact) show a deterministic sequence on the **single shared `ClientCnx`** between the test's `v5Client` and the broker: 1. **First producer admitted normally.** - `16:26:37.710` client sends `CommandProducer` for `producerId=0` on the segment topic `segment://…/scalable-089dd2ce/0000-ffff-0`. - `16:26:37.731` broker `Updated topic epoch to {newEpoch=0}`. - `16:26:37.732` broker `Created new producer {producerId=0, producerName=test-cluster-0-48}`. - `16:26:37.733` client `Created producer producerId=0`. The first `create()` resolves to the test code. 2. **Second V5 producer's lookup runs on the same connection.** - `16:26:37.736` broker pushes the initial `ScalableTopicUpdate` for `sessionId=54` (the second V5 producer's `DagWatchClient` session) — `activeSegmentCount=1`. - `16:26:37.736` client logs `Layout updated` / `Layout applied {activeSegments=[0], epoch=0}` for the second producer. 3. **The first producer is closed by the *client* before the second producer's `CommandProducer` is sent.** - `16:26:37.738` broker `Closing producer on cnx. producerId=0, producerName=test-cluster-0-48` — this log comes from `ServerCnx.handleCloseProducer` (`ServerCnx.java:2821`), which only fires when the broker receives a `CommandCloseProducer` from the client. - `16:26:37.738` broker `Closed producer on cnx. producerId=0`. - `16:26:37.739` client `Closed Producer producerId=0`. - When `producer.close(true)` resolves on the broker, the topic's `producers` map drops `producerId=0` and `hasExclusiveProducer` is reset to `false` (`AbstractTopic.java`, the exclusive-producer release path). 4. **Second producer is then admitted as a *fresh* exclusive producer.** - `16:26:37.740` client sends `CommandProducer` for `producerId=1` on the same segment topic. - `16:26:37.747` broker `Updated topic epoch to {newEpoch=1}` — i.e. `handleTopicEpochForExclusiveProducer` ran, which only happens when admission *succeeds* via the `case Exclusive` branch in `AbstractTopic.incrementTopicEpochIfNeeded` (`AbstractTopic.java:844`-`852`). That branch's gate is `!hasExclusiveProducer && producers.isEmpty()`, so it is consistent with the previous step having already removed producer 0. - `16:26:37.747` broker `Created new producer producerId=1`. - `16:26:37.748` client `Created producer producerId=1`, `topicEpoch=Optional[1]`. The second `create()` resolves successfully — the test then trips `fail(...)`. There is **no `ProducerBusyException` or `ProducerFencedException`** anywhere in the test's broker output for this case. The broker followed the *normal* admission path for the second producer; it didn't reject and then fence the first. The retry attempt (separate namespace `test-3d723c4c`) reproduces the exact same sequence with the next pair of `producerId`s (`2` then `3`), `topicEpoch` again going `0 → 1`, and the same `Closing producer on cnx` log between the second lookup and the second `CommandProducer`. The pattern is consistent, not a one-off. ## Where the race lives The two V5 producers in the test share a single `PulsarClient` (`v5Client`, set up in `V5ClientBaseTest.setupSharedV5Client()`), which means they share a single `ClientCnx`. The race surfaces only on this shared connection: the first segment producer's lifecycle is being torn down by the V5 client right around the moment the second V5 producer's session lookup completes, leaving the broker briefly in a state where there is *no* exclusive holder. The second producer's `CommandProducer` then arrives during that window and is admitted as if no prior holder ever existed. Because the first producer's tear-down is initiated by the *client* (the broker only logs `handleCloseProducer`, which is exclusively a response to a client-sent `CommandCloseProducer`), the proximate cause is something on the V5 client's side that closes the first producer's underlying v4 segment producer when the second `EXCLUSIVE` `create()` runs against the same connection. The `case Exclusive` broker branch is then a passive participant — it correctly admits the second producer because, by the time the broker sees the second `CommandProducer`, the first one is already gone. ## What is *not* yet root-caused I did not pin down the exact V5 client code path that emits `CommandCloseProducer` for the first producer. The obvious candidates were ruled out from a static reading: - `ProducerBuilderV5.createAsync` only calls `producer.closeAsync()` from its `exceptionallyCompose`, i.e. when `eagerAttachInitialAsync` *fails*. In the flaky run the second `eagerAttach` succeeds, so this path is not entered. - `ScalableTopicProducer.applyLayout` only closes per-segment producers whose segment id is no longer in the new layout. The layout reported by the broker has `activeSegments=[0]` throughout, so this does not close anything on either V5 producer. - Each `ScalableTopicProducer` has its own private `segmentProducers` map; the second V5 producer's `getOrCreateSegmentProducer(0)` therefore does not reach into the first producer's map. The fact that the `CommandCloseProducer` for the first segment producer arrives on the wire ~2 ms before the second producer's `CommandProducer` (and consistently in both retries) suggests the V5 client *is* deliberately closing the prior segment producer as part of the second `EXCLUSIVE` `create()` flow on the shared connection — it just isn't via any of the call sites listed above. The remaining places worth tracing are anything reachable from `DagWatchClient.start()` / `onUpdate` on the shared `ClientCnx` (registering `sessionId=54` while `sessionId=53` is alive on the same cnx) and any `v4Client`-level state shared between segment producers for the same `segment://…` topic. This deserves a separate investigation: the observable effect is that on a shared `ClientCnx`, V5's `EXCLUSIVE` admission on a topic that already has an exclusive holder ends up *displacing* the prior holder rather than failing fast — which contradicts the documented contract of `EXCLUSIVE` (vs. `EXCLUSIVE_WITH_FENCING`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
