Re: [I] Flaky-test: V5ProducerAccessModeTest > testExclusiveRejectsSecondProducer [pulsar]

via GitHub Tue, 05 May 2026 15:03:16 -0700


lhotari commented on issue #25678:
URL: https://github.com/apache/pulsar/issues/25678#issuecomment-4383473869


   # Root cause: V5ProducerAccessModeTest.testExclusiveRejectsSecondProducer
   
   **Issue:** apache/pulsar#25678
   **Observed in:** Pulsar CI run 25387350198, attempt 1, job `CI - Unit - 
Brokers - Broker Group 2`
   **Build scan:** https://gradle.com/s/jnnlhs2ddi5h4
   **PR context:** #25675 (fork PR; the stated subject was an unrelated test 
fix — this flake is independent)
   
   ## Symptom
   
   The test asserts that a second `EXCLUSIVE` 
`v5Client.newProducer(...).create()` on the same scalable topic must throw 
`PulsarClientException` (the V5 builder is documented to "eagerly claim every 
active segment up front for exclusive access modes"). In the flaking run, the 
second `create()` returns a `Producer` instead of throwing, so the test trips 
on `fail("a second EXCLUSIVE producer should not be able to attach")` at 
`V5ProducerAccessModeTest.java:92`. TestNG retries once; the retry behaves 
identically. Both attempts complete in well under 60 ms, which is far too fast 
for any normal broker-side fence to round-trip.
   
   ## What the broker logs show
   
   Broker logs from the failing job (extracted from the 
`Unit-BROKER_GROUP_2-test-reports` artifact) show a deterministic sequence on 
the **single shared `ClientCnx`** between the test's `v5Client` and the broker:
   
   1. **First producer admitted normally.**
      - `16:26:37.710` client sends `CommandProducer` for `producerId=0` on the 
segment topic `segment://…/scalable-089dd2ce/0000-ffff-0`.
      - `16:26:37.731` broker `Updated topic epoch to {newEpoch=0}`.
      - `16:26:37.732` broker `Created new producer {producerId=0, 
producerName=test-cluster-0-48}`.
      - `16:26:37.733` client `Created producer producerId=0`. The first 
`create()` resolves to the test code.
   
   2. **Second V5 producer's lookup runs on the same connection.**
      - `16:26:37.736` broker pushes the initial `ScalableTopicUpdate` for 
`sessionId=54` (the second V5 producer's `DagWatchClient` session) — 
`activeSegmentCount=1`.
      - `16:26:37.736` client logs `Layout updated` / `Layout applied 
{activeSegments=[0], epoch=0}` for the second producer.
   
   3. **The first producer is closed by the *client* before the second 
producer's `CommandProducer` is sent.**
      - `16:26:37.738` broker `Closing producer on cnx. producerId=0, 
producerName=test-cluster-0-48` — this log comes from 
`ServerCnx.handleCloseProducer` (`ServerCnx.java:2821`), which only fires when 
the broker receives a `CommandCloseProducer` from the client.
      - `16:26:37.738` broker `Closed producer on cnx. producerId=0`.
      - `16:26:37.739` client `Closed Producer producerId=0`.
      - When `producer.close(true)` resolves on the broker, the topic's 
`producers` map drops `producerId=0` and `hasExclusiveProducer` is reset to 
`false` (`AbstractTopic.java`, the exclusive-producer release path).
   
   4. **Second producer is then admitted as a *fresh* exclusive producer.**
      - `16:26:37.740` client sends `CommandProducer` for `producerId=1` on the 
same segment topic.
      - `16:26:37.747` broker `Updated topic epoch to {newEpoch=1}` — i.e. 
`handleTopicEpochForExclusiveProducer` ran, which only happens when admission 
*succeeds* via the `case Exclusive` branch in 
`AbstractTopic.incrementTopicEpochIfNeeded` (`AbstractTopic.java:844`-`852`). 
That branch's gate is `!hasExclusiveProducer && producers.isEmpty()`, so it is 
consistent with the previous step having already removed producer 0.
      - `16:26:37.747` broker `Created new producer producerId=1`.
      - `16:26:37.748` client `Created producer producerId=1`, 
`topicEpoch=Optional[1]`. The second `create()` resolves successfully — the 
test then trips `fail(...)`.
   
   There is **no `ProducerBusyException` or `ProducerFencedException`** 
anywhere in the test's broker output for this case. The broker followed the 
*normal* admission path for the second producer; it didn't reject and then 
fence the first.
   
   The retry attempt (separate namespace `test-3d723c4c`) reproduces the exact 
same sequence with the next pair of `producerId`s (`2` then `3`), `topicEpoch` 
again going `0 → 1`, and the same `Closing producer on cnx` log between the 
second lookup and the second `CommandProducer`. The pattern is consistent, not 
a one-off.
   
   ## Where the race lives
   
   The two V5 producers in the test share a single `PulsarClient` (`v5Client`, 
set up in `V5ClientBaseTest.setupSharedV5Client()`), which means they share a 
single `ClientCnx`. The race surfaces only on this shared connection: the first 
segment producer's lifecycle is being torn down by the V5 client right around 
the moment the second V5 producer's session lookup completes, leaving the 
broker briefly in a state where there is *no* exclusive holder. The second 
producer's `CommandProducer` then arrives during that window and is admitted as 
if no prior holder ever existed.
   
   Because the first producer's tear-down is initiated by the *client* (the 
broker only logs `handleCloseProducer`, which is exclusively a response to a 
client-sent `CommandCloseProducer`), the proximate cause is something on the V5 
client's side that closes the first producer's underlying v4 segment producer 
when the second `EXCLUSIVE` `create()` runs against the same connection. The 
`case Exclusive` broker branch is then a passive participant — it correctly 
admits the second producer because, by the time the broker sees the second 
`CommandProducer`, the first one is already gone.
   
   ## What is *not* yet root-caused
   
   I did not pin down the exact V5 client code path that emits 
`CommandCloseProducer` for the first producer. The obvious candidates were 
ruled out from a static reading:
   
   - `ProducerBuilderV5.createAsync` only calls `producer.closeAsync()` from 
its `exceptionallyCompose`, i.e. when `eagerAttachInitialAsync` *fails*. In the 
flaky run the second `eagerAttach` succeeds, so this path is not entered.
   - `ScalableTopicProducer.applyLayout` only closes per-segment producers 
whose segment id is no longer in the new layout. The layout reported by the 
broker has `activeSegments=[0]` throughout, so this does not close anything on 
either V5 producer.
   - Each `ScalableTopicProducer` has its own private `segmentProducers` map; 
the second V5 producer's `getOrCreateSegmentProducer(0)` therefore does not 
reach into the first producer's map.
   
   The fact that the `CommandCloseProducer` for the first segment producer 
arrives on the wire ~2 ms before the second producer's `CommandProducer` (and 
consistently in both retries) suggests the V5 client *is* deliberately closing 
the prior segment producer as part of the second `EXCLUSIVE` `create()` flow on 
the shared connection — it just isn't via any of the call sites listed above. 
The remaining places worth tracing are anything reachable from 
`DagWatchClient.start()` / `onUpdate` on the shared `ClientCnx` (registering 
`sessionId=54` while `sessionId=53` is alive on the same cnx) and any 
`v4Client`-level state shared between segment producers for the same 
`segment://…` topic.
   
   This deserves a separate investigation: the observable effect is that on a 
shared `ClientCnx`, V5's `EXCLUSIVE` admission on a topic that already has an 
exclusive holder ends up *displacing* the prior holder rather than failing fast 
— which contradicts the documented contract of `EXCLUSIVE` (vs. 
`EXCLUSIVE_WITH_FENCING`).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Flaky-test: V5ProducerAccessModeTest > testExclusiveRejectsSecondProducer [pulsar]

Reply via email to