poorbarcode opened a new pull request, #21946:
URL: https://github.com/apache/pulsar/pull/21946

   ### Motivation
   
   There is a race condition that makes an orphan replicator in the original 
owner of a topic, and causes the new owner of the topic can not start a 
replicator due to 
`org.apache.pulsar.broker.service.BrokerServiceException$NamingException 
Producer with name 'pulsar.repl.{local_cluster}-->{remote_cluster}' is already 
connected to topic`.
   
   **Scenario 1**
   - Thread-1: start/restart the producer of the replicator.
   - Thread-2: unloading bundles.
   
   **Scenario 2**
   - Thread-1: start a new replicator after updated `replication_clusters`.
   - Thread-2: unloading bundles.
   
   Current PR is focusing on Scenario 1.
   
   **Steps of Scenario 1**
   
   | time  | `thread start replicator` |  thread `unload bundle` | 
   | --- | --- | --- |
   | 1 | Initialize cursor | 
   | 2 | Start producer |
   | 3 | Start producer failed, add a scheduled task to retry |
   | 4 | Mark topic as `closing` |
   | 4 | | Slose clients: `replicator.disconnect` |
   | 5 | | Skip to close the producer because the producer is null, and set 
`replicator.stat --> Stopped` |
   | 6 | Retry to start the producer |
   | 7 | Set `replicator.stat  --> Starting` | 
   | 8 | Create producer success and set `replicator.stat --> Started` |
   | 9 | Trigger a `readMoreEntries`, since there is no entries to read, just 
pending this request |
   | 10 | | Close cursor `pulsar.repl` |
   | 11 | | Close managed ledger |
   | 12 | An orphan replicator is there, and the next topic owner could not 
start a replicator due to `Producer with name 
'pulsar.repl.{local_cluster}-->{remote_cluster}' is already connected to topic` 
|
   
   ### Modifications
   - Split the state of `Replicator.State.Stopped` into `Producer_Stopped ` and 
`Closed`.
     - The producer can be restart again after the producer closed due to read 
entries error or ack messages error. 
     - The Replicator can not be started again after it was closed due to the 
topic being closed or having disabled replication. 
   
   Since the scenario is too complex, I can not add a test. But I reproduced 
the Scenario 1 locally.
   <img width="1662" alt="Screenshot 2024-01-23 at 00 36 48" 
src="https://github.com/apache/pulsar/assets/25195800/0e47b199-84a5-4d5f-8b93-81ff1bdf0b6b";>
   <img width="1664" alt="Screenshot 2024-01-23 at 00 35 33" 
src="https://github.com/apache/pulsar/assets/25195800/523480d5-9c75-4013-80b1-8790bda8d389";>
   
   TODO: 
   - How to perfectly start a new Replicator after the replication has been 
enabled again.
   - How to perfectly start a new Replicator when calling 
`topic.unfenceTopicToResume` after `topic.close` failed.
   
   ### Documentation
   
   <!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->
   
   - [ ] `doc` <!-- Your PR contains doc changes. -->
   - [ ] `doc-required` <!-- Your PR changes impact docs and you will update 
later -->
   - [x] `doc-not-needed` <!-- Your PR changes do not impact docs -->
   - [ ] `doc-complete` <!-- Docs have been already added -->
   
   ### Matching PR in forked repository
   
   PR in forked repository: x


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to