[GitHub] [incubator-seatunnel] ic4y opened a new issue, #2430: [ST-Engine][Design] How to deal with network partitions

GitBox Tue, 16 Aug 2022 04:03:24 -0700


ic4y opened a new issue, #2430:
URL: https://github.com/apache/incubator-seatunnel/issues/2430

### Search before asking

- [X] I had searched in the
[feature](https://github.com/apache/incubator-seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
and found no similar feature requirement.

### Description

Now the engine is implemented by hazelcast, and hazelcast has a split-brain
problem when the network is partitioned. How our engine solves this problem.

For example, there is currently a cluster of 5 nodes.

![image](https://user-images.githubusercontent.com/83933160/184854029-020efdc0-656c-4d1d-a4bc-fcfc85f5d106.png)

At this time, due to a network failure, the network is partitioned. Network
disconnection between nodes 1 2 and nodes 3 4 5,this creates two clusters
resulting in a split brain.

![image](https://user-images.githubusercontent.com/83933160/184854048-fa6e90ef-0d99-42e8-8e57-0f2e8af37c74.png)

This will cause the same task to be started in both clusters, which will be
problematic

Two solutions that come to mind：

1、Use the minimum number of running nodes
The cluster will have a minimum number of running nodes, which is the
total number of nodes/2+1. Only the number of cluster nodes is greater than
this number to run. This ensures that at most one cluster will function
properly in the event of a network partition.
A disadvantage of this is that if the number of running nodes is less
than 1/2 of the total number, it cannot continue to run. That is to say, if
there are 5 nodes, when more than 2 nodes fail, the entire cluster cannot
continue to run. And what we want is to work even if only one node is left.

2、Use an external file system
We will use an external file system to store checkpoint data in cluster
mode，So we can use an external filesystem。When a network partition occurs,
multiple clusters will be generated，At this time, each cluster reports its own
number of nodes to the file system. The file system selects the cluster with
the largest number of nodes to keep it running，and notify other clusters to
stop running. This also ensures that only one of the largest sub-clusters is
running in the event of a network partition.

### Usage Scenario

_No response_

### Related issues

_No response_

### Are you willing to submit a PR?

- [ ] Yes I am willing to submit a PR!

### Code of Conduct

- [X] I agree to follow this project's [Code of
Conduct](https://www.apache.org/foundation/policies/conduct)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-seatunnel] ic4y opened a new issue, #2430: [ST-Engine][Design] How to deal with network partitions

Reply via email to