void-ptr974 opened a new issue, #25954:
URL: https://github.com/apache/pulsar/issues/25954

   ### Issue Description
   
   Message deduplication recovery can be started more than once for the same 
topic before the first recovery replay completes.
   
   `MessageDeduplication` defines a `Recovering` status to prevent overlapping 
transitions, but the enable path starts dedup cursor replay without first 
changing the status to `Recovering`.
   
   Because of that, another `checkStatus()` call can still observe 
`Initialized` or `Disabled` and start a second replay for the same dedup cursor.
   
   One practical trigger path is topic load with existing topic-level policies:
   
   | Time | Flow A: existing topic-level policy during load | Flow B: normal 
topic load chain | Dedup status / effect |
   
|------|--------------------------------------------------|----------------------------------|-----------------------|
   | T1 | `initTopicPolicy()` loads existing topic policies | | dedup status is 
`Initialized` or `Disabled` |
   | T2 | `onUpdate()` applies topic policies and calls 
`checkDeduplicationStatus()` | | first dedup recovery replay starts |
   | T3 | first replay is still running asynchronously | | status is still 
`Initialized` or `Disabled` |
   | T4 | | topic load continues and `BrokerService` explicitly calls 
`checkDeduplicationStatus()` | |
   | T5 | | second check also enters the enable path | second replay starts for 
the same dedup cursor |
   | T6 | first replay rebuilds producer sequence state | second replay also 
rebuilds producer sequence state | overlapping replay can advance shared 
replay/cursor state |
   
   Dedup recovery rebuilds producer sequence information from the dedup cursor. 
If overlapping replay leaves the recovered sequence state incomplete or 
inconsistent, the broker may fail to identify already-published messages as 
duplicates.
   
   The impact is that duplicate messages can be accepted after topic load or 
policy refresh even though message deduplication is enabled.
   
   There is also a retry gap after recovery failure. If enabling deduplication 
fails transiently, the status moves to `Failed`, but later checks do not retry 
enabling even when the topic policy still requires deduplication.
   
   ### Error messages
   
   There may be no error message. This is a race in the dedup recovery state 
machine.
   
   ### Reproducing the issue
   
   A deterministic unit-level reproduction can be built by delaying the first 
dedup recovery replay and invoking `checkStatus()` again before the first 
replay completes:
   
   1. Keep `MessageDeduplication` in `Initialized` or `Disabled`.
   2. Call `checkStatus()` once and delay async cursor open or replay 
completion.
   3. Call `checkStatus()` again before the first recovery finishes.
   4. Observe that the second call enters the enable path again and starts 
another replay.
   
   A production-like trigger path is:
   
   1. Enable message deduplication.
   2. Use a topic with existing topic-level policies.
   3. Load or reload the topic.
   4. During topic load, `initTopicPolicy()` can invoke `onUpdate()`, which 
applies topic policies and calls `checkDeduplicationStatus()`.
   5. The normal topic load chain later calls `checkDeduplicationStatus()` 
again.
   6. If the first recovery replay is still in progress, both checks can start 
overlapping dedup replay.
   
   ### Expected behavior
   
   - Deduplication should enter `Recovering` before starting recovery replay.
   - Concurrent status checks should not start another recovery while replay is 
in progress.
   - If enable fails transiently, a later check should retry enabling when 
deduplication is still required by policy.
   
   ### Actual behavior
   
   - Enable starts replay while status remains `Initialized` or `Disabled`.
   - Another `checkStatus()` can start a second replay before the first 
completes.
   - `Failed` status does not retry enable even when deduplication should still 
be enabled.
   
   ### Additional information
   
   A fix has been proposed in #25953.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to