GitHub user andrewseif edited a discussion: GSoC: Support Seata Multi Raft

Hi Seata Community đź‘‹


# GSoC: Support Seata Multi Raft

This is a working proposal for GSoC Seata Server Multi Raft capability.

**Purpose:**

The current Apache Seata Server supports the Raft cluster mode, but the
performance and throughput of the cluster are significantly limited due to the
single leader in a single Raft group. Therefore, the goal is to extend Seata 
Server
to support multi-raft capability.

**Topics to cover:**

* What will this GSoC cover (Acceptance/ Success Criteria) @funky-eyes [cite: 
2] will need to adjust this part.
* Transition from Single Raft to Multi-Raft.
* How will we maintain state in multi raft?
* Raft Config
* Raft group meta data & Raft group node addition
* Loadbalancing.


---
**Transition from Single Raft to Multi-Raft:**

Instead of having one raft group and one single leader that receive all traffic 
we can have (on the same number of nodes) multiple raft groups.

For our examples and moving forward lets assume we have 3 physical nodes and 
$rf=3$.

The current setup looks something like this, where you have multiple clients 
talking to the same leader which could easily exhaust the leader resources.

<img width="1284" alt="image" 
src="https://github.com/user-attachments/assets/fbe921b8-4599-41f8-b316-08e7d5b49b08";
 />

The proposed change would effectively make each node a raft group.

That means we can have 3 raft groups on the same number of nodes instead of 1 
raft groups.

conceptually it would look like this:

<img width="1111" alt="image1" 
src="https://github.com/user-attachments/assets/15b19662-da81-4355-8f44-db227453b697";
 />


where multiple clients would be talking to different leaders and requests 
loadbalanced across different nodes which would lessen the load on a single 
leader load.

Next question would be, how will we maintain the state in multi raft? how will 
raft group 1 know about the locks being used in raft group 2 or if it can 
acquire lock or not.

The simple answer is it doesnt, or more like it only needs to know about the 
current locks (LOCK_MAP), and see if any branch transaction will cause on 
conflict.

for that to work we will need a special raft group that will hold the LOCK_MAP, 
this group maintains the locks and will need to be “consulted” before any 
branch session is locked. 
Basically a call to that special group to make sure that the current lock is 
lock-able.

This also mean that even though the branchsession and lock_map arent 
synchronised, it doesnt really matter since what matters is the LOCK_MAP. 

Conceptually it would look like this:

<img width="1318" alt="image2" 
src="https://github.com/user-attachments/assets/df2979bb-c54c-4617-a4b8-f56206224c83";
 />

However in actual physical implementation it would look like this:

<img width="905" alt="image3" 
src="https://github.com/user-attachments/assets/726b3135-cc32-45ed-8458-8ce018c6b9b0";
 />

with this setup we can have 3 raft groups on 3 nodes and the special group.

now lets talk about that special group for abit, and lets call it a central 
group, because as you will see and the name implies it will be responsible for 
the management of the other Raft groups AND handling the LOCK_MAP.

Now a question might come up with why? why do we need a central group? that 
feels like we are looping back to single leader.

While this might appear true at first but its actually not, think of it as a 
similar pattern of a database proxy, you are moving from N:1 setup (N number of 
clients to 1 leader, infinite), to still N:1 (but that N is smaller 3/5/7, and 
finite).

The second reason for having a central group is for metadata and raft group 
management, now you might ask why we need to management raft groups and what 
meta data?

The raft groups are going to be the leaders TM&RM talks to, to start the 
distributed transaction, and since we will have multiple groups we will need a 
central group to manage the other raft groups.

Think of it link a controller and broker setup in Kafka, where your controller 
is responsible for the metadata.

In our case the central group is responsible for the metadata and LOCK_MAP, and 
the other raft groups are your partitions (the brokers/ the groups that do the 
actual work).

This also brings us to a big question, how will our system scale up and scale 
down.

Lets discuss scaling up first since this is the easier one.

Following Kafka partitions we can increase partitions (raft gorups in our case) 
by using round robin.

So lets say we have 4 actual nodes and 3 raft groups with rf = 3 (and the 
central one), for simplicities sake we will have the central group on each node 
(that also means when a failure happens your % failure is lower, 3 nodes and 1 
down is 33% of your cluster down, 4 nodes and 1 down is 25%.. etc)

The first RG1 will be on Node 1 and RG1 replicas will be on node 2 & 3.
RG2 leader on Node 2 and RG2 replicas on 3&4

RG3 leader on Node 3 and RG3 replicas on 4&1

its like having an array of Node and you just rotate on them on by one and when 
you finish the array you just loop back.

<img width="905" alt="image3" 
src="https://github.com/user-attachments/assets/99a065f2-eb0c-4398-8a94-530e72b0d5f4";
 />

if we get RG4 then RG4 leader will be on Node 4 and RG replicas on 1&2

<img width="913" alt="image4" 
src="https://github.com/user-attachments/assets/2be482ab-3e1b-4c27-97e3-9a9509e1a7fe";
 />

using round robin makes the scale up process pretty deterministic which is what 
we will need from a scale up algorithm. 

this will also make scaling down easier since if you remove a leader, you 
easily know where to find its followers if needed. 

Part of scaling up is also rebalancing, while we wont discuss rebalancing here 
since its a whole other topic, but it will need to be covered in the future, 
specially if using round robin type algorithms.

Why? because any new node added will have a new raft group master but its 
followers will be added on Node 1 &2 ( since we are looping on the node array), 
and at some point node 1 and 2 will need to have groups “redistributed”.

For scaling down:

As you are well aware there are no scaling down in kafka partitions (for 
obvious reasons) since this entails alot of trouble when done and it needs to 
cover ALOT of edge cases, that will need its own project. 

however to make sure the solution is partially complete we can implement a 
naive approach to scaling down:

we can follow the same process that k8s is
following, by stopping traffic to that raft group and then shutting it down 
after it
finish processing all its locks.
If we want to remove a node we would do the same, we would stop the traffic
coming to that raft group leader on that node, and wait until the raft group 
process
all the locks, then shut it down.
Obviously if the number of nodes is less than 3, weird behaviour will be 
expected,
but aside from that scaling up and down should be deterministic.

this topic will need (like rebalacing) have its own project after GSoC, since 
there are alot of edge cases that needs to be covered like what happens if you 
are scaling down and another node crashed?…etc

Now back to the final topic before moving to other parts which is the central 
group:
We can either have rf =3 like any normal group and sacrifice resilience or have 
the central gorup, on all nodes, but that means longer rpc/quorum. 

the controller/central group will also handle all calls related to metadata and 
thus will need to be identified by the TM and RM.

Also why rf = 3 and not 5/7/11, right now lets keep it simple and stable then 
we can increase the number of rf and make it configurable from the end user, 
because each increase will have a trade off between resilience and speed.

---
**Raft Config:**

right now we have the following config

```
seata:
  server:
    raft:
      group: default # This value represents the group of this raft cluster, 
and the value corresponding to the client's transaction group should match it.
      server-addr: 192.168.0.111:9091,192.168.0.112:9091,192.168.0.113:9091 # 
IP and port of the 3 nodes, the port is the netty port of the node + 1000, 
default netty port is 8091
      snapshot-interval: 600 # Take a snapshot every 600 seconds for fast 
rolling of raftlog. However, making a snapshot every 600 seconds may cause 
business response time jitter if there is too much transaction data in memory. 
But it is friendly for fault recovery and faster node restart. You can adjust 
it to 30 minutes, 1 hour, etc., according to the business. You can test whether 
there is jitter on your own, and find a balance point between rt jitter and 
fault recovery.
      apply-batch: 32 # At most, submit raftlog once for 32 batches of actions
      max-append-bufferSize: 262144 # Maximum size of the log storage buffer, 
default is 256K
      max-replicator-inflight-msgs: 256 # In the case of enabling pipeline 
requests, the maximum number of in-flight requests, default is 256
      disruptor-buffer-size: 16384 # Internal disruptor buffer size. If it is a 
scenario with high write throughput, you need to appropriately increase this 
value. Default is 16384
      election-timeout-ms: 1000 # How long without a leader's heartbeat to 
start a new election
      reporter-enabled: false # Whether the monitoring of raft itself is enabled
      reporter-initial-delay: 60 # Interval of monitoring
      serialization: jackson # Serialization method, do not change
      compressor: none # Compression method for raftlog, such as gzip, zstd, 
etc.
      sync: true # Flushing method for raft log, default is synchronous flushing
  config:
    # support: nacos, consul, apollo, zk, etcd3
    type: file # This configuration can choose different configuration centers
  registry:
    # support: nacos, eureka, redis, zk, consul, etcd3, sofa
    type: file # Non-file registration center is not allowed in raft mode
  store:
    # support: file, db, redis, raft
    mode: raft # Use raft storage mode
    file:
      dir: sessionStore # This path is the storage location of raftlog and 
related transaction logs, default is relative path, it is better to set a fixed 
location
```
if we decide to go with controller/central group rf =3 then we should add a 
"controller" configuration item to the configuration file, allowing us to 
configure which nodes will have the controller.

```
seata:
  server:
    raft:
      group: default # This value represents the group of this raft cluster, 
and the value corresponding to the client's transaction group should match it.
      server-addr: 192.168.0.111:9091,192.168.0.112:9091,192.168.0.113:9091 # 
IP and port of the 3 nodes, the port is the netty port of the node + 1000, 
default netty port is 8091
      controller: 192.168.0.111:7091, 192.168.0.112:7091, 192.168.0.113:7091
      snapshot-interval: 600 # Take a snapshot every 600 seconds for fast 
rolling of raftlog. However, making a snapshot every 600 seconds may cause 
business response time jitter if there is too much transaction data in memory. 
But it is friendly for fault recovery and faster node restart. You can adjust 
it to 30 minutes, 1 hour, etc., according to the business. You can test whether 
there is jitter on your own, and find a balance point between rt jitter and 
fault recovery.
      apply-batch: 32 # At most, submit raftlog once for 32 batches of actions
      max-append-bufferSize: 262144 # Maximum size of the log storage buffer, 
default is 256K
      max-replicator-inflight-msgs: 256 # In the case of enabling pipeline 
requests, the maximum number of in-flight requests, default is 256
      disruptor-buffer-size: 16384 # Internal disruptor buffer size. If it is a 
scenario with high write throughput, you need to appropriately increase this 
value. Default is 16384
      election-timeout-ms: 1000 # How long without a leader's heartbeat to 
start a new election
      reporter-enabled: false # Whether the monitoring of raft itself is enabled
      reporter-initial-delay: 60 # Interval of monitoring
      serialization: jackson # Serialization method, do not change
      compressor: none # Compression method for raftlog, such as gzip, zstd, 
etc.
      sync: true # Flushing method for raft log, default is synchronous flushing
  config:
    # support: nacos, consul, apollo, zk, etcd3
    type: file # This configuration can choose different configuration centers
  registry:
    # support: nacos, eureka, redis, zk, consul, etcd3, sofa
    type: file # Non-file registration center is not allowed in raft mode
  store:
    # support: file, db, redis, raft
    mode: raft # Use raft storage mode
    file:
      dir: sessionStore # This path is the storage location of raftlog and 
related transaction logs, default is relative path, it is better to set a fixed 
location
```

or in case of more nodes

```
seata:
  server:
    raft:
      group: default # This value represents the group of this raft cluster, 
and the value corresponding to the client's transaction group should match it.
      server-addr: 192.168.0.111:7091, 192.168.0.112:7091, 192.168.0.113:7091, 
192.168.0.114:7091, 192.168.0.115:7091 
      controller: 192.168.0.111:7091, 192.168.0.112:7091, 192.168.0.113:7091
      snapshot-interval: 600 # Take a snapshot every 600 seconds for fast 
rolling of raftlog. However, making a snapshot every 600 seconds may cause 
business response time jitter if there is too much transaction data in memory. 
But it is friendly for fault recovery and faster node restart. You can adjust 
it to 30 minutes, 1 hour, etc., according to the business. You can test whether 
there is jitter on your own, and find a balance point between rt jitter and 
fault recovery.
      apply-batch: 32 # At most, submit raftlog once for 32 batches of actions
      max-append-bufferSize: 262144 # Maximum size of the log storage buffer, 
default is 256K
      max-replicator-inflight-msgs: 256 # In the case of enabling pipeline 
requests, the maximum number of in-flight requests, default is 256
      disruptor-buffer-size: 16384 # Internal disruptor buffer size. If it is a 
scenario with high write throughput, you need to appropriately increase this 
value. Default is 16384
      election-timeout-ms: 1000 # How long without a leader's heartbeat to 
start a new election
      reporter-enabled: false # Whether the monitoring of raft itself is enabled
      reporter-initial-delay: 60 # Interval of monitoring
      serialization: jackson # Serialization method, do not change
      compressor: none # Compression method for raftlog, such as gzip, zstd, 
etc.
      sync: true # Flushing method for raft log, default is synchronous flushing
  config:
    # support: nacos, consul, apollo, zk, etcd3
    type: file # This configuration can choose different configuration centers
  registry:
    # support: nacos, eureka, redis, zk, consul, etcd3, sofa
    type: file # Non-file registration center is not allowed in raft mode
  store:
    # support: file, db, redis, raft
    mode: raft # Use raft storage mode
    file:
      dir: sessionStore # This path is the storage location of raftlog and 
related transaction logs, default is relative path, it is better to set a fixed 
location
```

Also as you can guess we are using the shared setup (as kafka calls it when 
they have controller and brokers on the same node), but we can always move it 
to separate nodes in the future if that will improve the performance. 

this image shows the difference between the two [from 
kafka](https://developer.confluent.io/courses/architecture/control-plane/)

![image6](https://github.com/user-attachments/assets/a662c57d-41ae-4bf3-b70f-f091ffea3f00)

---
**Raft Group Metadata:**

Mainly we will need to update the `ClusterController` class to have the ability 
to return few additional meta data like:

How many groups are on specific node?

How many active raft groups?

Add new config/node.

---
**Loadbalancing:**

we need to talk about loadbalancing to ensure optimal performance, but for now 
we can use the available ones, and see if we need to create new specific ones 
for multi-raft in the future.

---
**List of future features we can add to multi-raft:**

Improved Loadbalancing / specialized algorithm

Rebalancing (manual and auto)

support of resiliency config or Kafka like config

support of increased number of replica factor 5/7/11

Improved scaling down algorithms

Metrics and performance measurement between different deployment types.

GitHub link: https://github.com/apache/incubator-seata/discussions/7404

----
This is an automatically sent email for dev@seata.apache.org.
To unsubscribe, please send an email to: dev-unsubscr...@seata.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@seata.apache.org
For additional commands, e-mail: dev-h...@seata.apache.org

Reply via email to