poorbarcode opened a new pull request, #21946:
URL: https://github.com/apache/pulsar/pull/21946
### Motivation
There is a race condition that makes an orphan replicator in the original
owner of a topic, and causes the new owner of the topic can not start a
replicator due to
`org.apache.pulsar.broker.service.BrokerServiceException$NamingException
Producer with name 'pulsar.repl.{local_cluster}-->{remote_cluster}' is already
connected to topic`.
**Scenario 1**
- Thread-1: start/restart the producer of the replicator.
- Thread-2: unloading bundles.
**Scenario 2**
- Thread-1: start a new replicator after updated `replication_clusters`.
- Thread-2: unloading bundles.
Current PR is focusing on Scenario 1.
**Steps of Scenario 1**
| time | `thread start replicator` | thread `unload bundle` |
| --- | --- | --- |
| 1 | Initialize cursor |
| 2 | Start producer |
| 3 | Start producer failed, add a scheduled task to retry |
| 4 | Mark topic as `closing` |
| 4 | | Slose clients: `replicator.disconnect` |
| 5 | | Skip to close the producer because the producer is null, and set
`replicator.stat --> Stopped` |
| 6 | Retry to start the producer |
| 7 | Set `replicator.stat --> Starting` |
| 8 | Create producer success and set `replicator.stat --> Started` |
| 9 | Trigger a `readMoreEntries`, since there is no entries to read, just
pending this request |
| 10 | | Close cursor `pulsar.repl` |
| 11 | | Close managed ledger |
| 12 | An orphan replicator is there, and the next topic owner could not
start a replicator due to `Producer with name
'pulsar.repl.{local_cluster}-->{remote_cluster}' is already connected to topic`
|
### Modifications
- Split the state of `Replicator.State.Stopped` into `Producer_Stopped ` and
`Closed`.
- The producer can be restart again after the producer closed due to read
entries error or ack messages error.
- The Replicator can not be started again after it was closed due to the
topic being closed or having disabled replication.
Since the scenario is too complex, I can not add a test. But I reproduced
the Scenario 1 locally.
<img width="1662" alt="Screenshot 2024-01-23 at 00 36 48"
src="https://github.com/apache/pulsar/assets/25195800/0e47b199-84a5-4d5f-8b93-81ff1bdf0b6b">
<img width="1664" alt="Screenshot 2024-01-23 at 00 35 33"
src="https://github.com/apache/pulsar/assets/25195800/523480d5-9c75-4013-80b1-8790bda8d389">
TODO:
- How to perfectly start a new Replicator after the replication has been
enabled again.
- How to perfectly start a new Replicator when calling
`topic.unfenceTopicToResume` after `topic.close` failed.
### Documentation
<!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->
- [ ] `doc` <!-- Your PR contains doc changes. -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update
later -->
- [x] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->
### Matching PR in forked repository
PR in forked repository: x
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]