hangc0276 opened a new issue #13315:
URL: https://github.com/apache/pulsar/issues/13315
### Motivation
We have geo-replication to support Pulsar cluster level failover. We can
setup Pulsar cluster A as a primary cluster in data center A, and setup Pulsar
cluster B as backup cluster in data center B. Then we configure geo-replication
between cluster A and cluster B. All the clients are connected to the Pulsar
cluster by DNS. If cluster A is down, we should switch the DNS to point the
target Pulsar cluster from cluster A to cluster B. After the clients are
resolved to cluster B, they can produce and consume messages normally. After
cluster A recovers, the administrator should switch the DNS back to cluster A.
However, the current method has two shortcomings.
1. The administrator should monitor the status of all Pulsar clusters, and
switch the DNS as soon as possible when cluster A is down. The switch and
recovery is not automatic and recovery time is controlled by the administrator,
which will put the administrator under heavy load.
2. The Pulsar client and DNS system have a cache. When the administrator
switches the DNS from cluster A to Cluster B, it will take some time for cache
trigger timeout, which will delay client recovery time and lead to the
product/consumer message failing.
### Goal
It's better to provide an automatic cluster level failure recovery mechanism
to make pulsar cluster failover more effective. We should support pulsar
clients auto switching from cluster A to cluster B when it detects cluster A
has been down according to the configured detecting policy and switch back to
cluster A when it has recovered. The reason why we should switch back to
cluster A is that most applications may be deployed in data center A and they
have low network cost for communicating with pulsar cluster A. If they keep
visiting pulsar cluster B, they have high network cost, and cause high
produce/consume latency.
In order to improve the DNS cache problem, we should provide an
administrator controlled switch provider for administrators to update service
URLs.
In the end, we should provide an auto service URL switch provider and
administrator controlled switch provider.
### Design
We have already provided the `ServiceUrlProvider` interface to support
different service URLs. In order to support automatic cluster level failure
auto recovery, we can provide different ServiceUrlProvider implementations. For
current requirements, we can provide `AutoClusterFailover` and
`ControlledClusterFailover`.
#### AutoClusterFailover
In order to support auto switching from the primary cluster to the
secondary, we can provide a probe task, which will probe the activity of the
primary cluster and the secondary one. When it found the primary cluster failed
more than `failoverDelayMs`, it will switch to the secondary cluster by calling
`updateServiceUrl`. After switch to the secondary cluster, the
`AutoClusterFailover` will continue to probe the primary cluster. If the
primary cluster comes back and remains active for `switchBackDelayMs`, it will
switch back to the primary cluster.
The APIs are listed as follows.
```Java
public class AutoClusterFailover implements ServiceUrlProvider {
private AutoClusterFailover(String primary, String secondary, long
failoverDelayMs, long switchBackDelayMs) {
}
@Override
public void initialize(PulsarClient client) {
this.pulsarClient = client;
// start to probe primary cluster active or not
this.timer.scheduleAtFixedRate(new TimerTask() {
@Override
public void run() {
// check
}
}, 30_000, 30_000);
}
@Override
public String getServiceUrl() {
return this.currentPulsarServiceUrl;
}
@Override
public void close() {
this.timer.cancel();
}
// probe pulsar cluster available
private boolean probeAvailable(String url, int timeout) {
}
```
In the `probeAvailable` method, we will probe the Pulsar service port, and
check whether the port is open.
#### ControlledClusterFailover
If the users want to control the cluster switch operation, they can provide
the current service URL by a http service. The `ControlledClusterFailover` will
get the newest service url from the provided http service periodically.
The APIs are listed as follows.
```Java
public class ControlledClusterFailover implements ServiceUrlProvider {
private ControlledClusterFailover(String defaultServiceUrl, String
urlProvider) throws IOException {
}
@Override
public void initialize(PulsarClient client) {
this.pulsarClient = client;
// start to check service url every 30 seconds
this.timer.scheduleAtFixedRate(new TimerTask() {
@Override
public void run() {
// do check and switch operation.
}
}, 30_000, 30_000);
}
private String fetchServiceUrl() throws IOException {
// call the service to get service URL
}
@Override
public String getServiceUrl() {
return this.currentPulsarServiceUrl;
}
@Override
public void close() {
this.timer.cancel();
}
```
### API Changes
For the current `ServiceUrlProvider` interface, we should add a `close`
method to close an allocated resource, such as a timer thread.
```Java
public interface ServiceUrlProvider {
/**
* Close the resource that the provider allocated.
*
*/
default void close() {
// do nothing
}
}
```
### Tests
Add tests for the two service provider implementations.
For `AutoClusterFailover`, when the primary cluster shuts down, it should
switch to the secondary cluster. And then the primary cluster came back, we
should switch back.
For `ControlledClusterFailover`, when switching the service url on the http
service side, it should switch to the newest service url.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]