GitHub user andrewseif edited a discussion: GSoC: Support Seata Multi Raft
Hi Seata Community 👋 # GSoC: Support Seata Multi Raft This is a working proposal for GSoC Seata Server Multi Raft capability. **Purpose:** The current Apache Seata Server supports the Raft cluster mode, but the performance and throughput of the cluster are significantly limited due to the single leader in a single Raft group. Therefore, the goal is to extend Seata Server to support multi-raft capability. **Topics to cover:** * What will this GSoC cover (Acceptance/ Success Criteria) @funky-eyes [cite: 2] will need to adjust this part. * Transition from Single Raft to Multi-Raft. * How will we maintain state in multi raft? * Raft Config * Raft group meta data & Raft group node addition * Loadbalancing. --- **Transition from Single Raft to Multi-Raft:** Instead of having one raft group and one single leader that receive all traffic we can have (on the same number of nodes) multiple raft groups. For our examples and moving forward lets assume we have 3 physical nodes and $rf=3$. The current setup looks something like this, where you have multiple clients talking to the same leader which could easily exhaust the leader resources. <img width="1284" alt="image" src="https://github.com/user-attachments/assets/fbe921b8-4599-41f8-b316-08e7d5b49b08" /> The proposed change would effectively make each node a raft group. That means we can have 3 raft groups on the same number of nodes instead of 1 raft groups. conceptually it would look like this: <img width="1111" alt="image1" src="https://github.com/user-attachments/assets/15b19662-da81-4355-8f44-db227453b697" /> where multiple clients would be talking to different leaders and requests loadbalanced across different nodes which would lessen the load on a single leader load. Next question would be, how will we maintain the state in multi raft? how will raft group 1 know about the locks being used in raft group 2 or if it can acquire lock or not. The simple answer is it doesnt, or more like it only needs to know about the current locks (LOCK_MAP), and see if any branch transaction will cause on conflict. for that to work we will need a special raft group that will hold the LOCK_MAP, this group maintains the locks and will need to be “consulted” before any branch session is locked. Basically a call to that special group to make sure that the current lock is lock-able. This also mean that even though the branchsession and lock_map arent synchronised, it doesnt really matter since what matters is the LOCK_MAP. Conceptually it would look like this: <img width="1318" alt="image2" src="https://github.com/user-attachments/assets/df2979bb-c54c-4617-a4b8-f56206224c83" /> However in actual physical implementation it would look like this: <img width="905" alt="image3" src="https://github.com/user-attachments/assets/726b3135-cc32-45ed-8458-8ce018c6b9b0" /> with this setup we can have 3 raft groups on 3 nodes and the special group. now lets talk about that special group for abit, and lets call it a central group, because as you will see and the name implies it will be responsible for the management of the other Raft groups AND handling the LOCK_MAP. Now a question might come up with why? why do we need a central group? that feels like we are looping back to single leader. While this might appear true at first but its actually not, think of it as a similar pattern of a database proxy, you are moving from N:1 setup (N number of clients to 1 leader, infinite), to still N:1 (but that N is smaller 3/5/7, and finite). The second reason for having a central group is for metadata and raft group management, now you might ask why we need to management raft groups and what meta data? The raft groups are going to be the leaders TM&RM talks to, to start the distributed transaction, and since we will have multiple groups we will need a central group to manage the other raft groups. Think of it link a controller and broker setup in Kafka, where your controller is responsible for the metadata. In our case the central group is responsible for the metadata and LOCK_MAP, and the other raft groups are your partitions (the brokers/ the groups that do the actual work). This also brings us to a big question, how will our system scale up and scale down. Lets discuss scaling up first since this is the easier one. Following Kafka partitions we can increase partitions (raft gorups in our case) by using round robin. So lets say we have 4 actual nodes and 3 raft groups with rf = 3 (and the central one), for simplicities sake we will have the central group on each node (that also means when a failure happens your % failure is lower, 3 nodes and 1 down is 33% of your cluster down, 4 nodes and 1 down is 25%.. etc) The first RG1 will be on Node 1 and RG1 replicas will be on node 2 & 3. RG2 leader on Node 2 and RG2 replicas on 3&4 RG3 leader on Node 3 and RG3 replicas on 4&1 its like having an array of Node and you just rotate on them on by one and when you finish the array you just loop back. <img width="905" alt="image3" src="https://github.com/user-attachments/assets/99a065f2-eb0c-4398-8a94-530e72b0d5f4" /> if we get RG4 then RG4 leader will be on Node 4 and RG replicas on 1&2 <img width="913" alt="image4" src="https://github.com/user-attachments/assets/2be482ab-3e1b-4c27-97e3-9a9509e1a7fe" /> using round robin makes the scale up process pretty deterministic which is what we will need from a scale up algorithm. this will also make scaling down easier since if you remove a leader, you easily know where to find its followers if needed. Part of scaling up is also rebalancing, while we wont discuss rebalancing here since its a whole other topic, but it will need to be covered in the future, specially if using round robin type algorithms. Why? because any new node added will have a new raft group master but its followers will be added on Node 1 &2 ( since we are looping on the node array), and at some point node 1 and 2 will need to have groups “redistributed”. For scaling down: As you are well aware there are no scaling down in kafka partitions (for obvious reasons) since this entails alot of trouble when done and it needs to cover ALOT of edge cases, that will need its own project. however to make sure the solution is partially complete we can implement a naive approach to scaling down: we can follow the same process that k8s is following, by stopping traffic to that raft group and then shutting it down after it finish processing all its locks. If we want to remove a node we would do the same, we would stop the traffic coming to that raft group leader on that node, and wait until the raft group process all the locks, then shut it down. Obviously if the number of nodes is less than 3, weird behaviour will be expected, but aside from that scaling up and down should be deterministic. this topic will need (like rebalacing) have its own project after GSoC, since there are alot of edge cases that needs to be covered like what happens if you are scaling down and another node crashed?…etc Now back to the final topic before moving to other parts which is the central group: We can either have rf =3 like any normal group and sacrifice resilience or have the central gorup, on all nodes, but that means longer rpc/quorum. the controller/central group will also handle all calls related to metadata and thus will need to be identified by the TM and RM. Also why rf = 3 and not 5/7/11, right now lets keep it simple and stable then we can increase the number of rf and make it configurable from the end user, because each increase will have a trade off between resilience and speed. --- **Raft Config:** right now we have the following config ``` seata: server: raft: group: default # This value represents the group of this raft cluster, and the value corresponding to the client's transaction group should match it. server-addr: 192.168.0.111:9091,192.168.0.112:9091,192.168.0.113:9091 # IP and port of the 3 nodes, the port is the netty port of the node + 1000, default netty port is 8091 snapshot-interval: 600 # Take a snapshot every 600 seconds for fast rolling of raftlog. However, making a snapshot every 600 seconds may cause business response time jitter if there is too much transaction data in memory. But it is friendly for fault recovery and faster node restart. You can adjust it to 30 minutes, 1 hour, etc., according to the business. You can test whether there is jitter on your own, and find a balance point between rt jitter and fault recovery. apply-batch: 32 # At most, submit raftlog once for 32 batches of actions max-append-bufferSize: 262144 # Maximum size of the log storage buffer, default is 256K max-replicator-inflight-msgs: 256 # In the case of enabling pipeline requests, the maximum number of in-flight requests, default is 256 disruptor-buffer-size: 16384 # Internal disruptor buffer size. If it is a scenario with high write throughput, you need to appropriately increase this value. Default is 16384 election-timeout-ms: 1000 # How long without a leader's heartbeat to start a new election reporter-enabled: false # Whether the monitoring of raft itself is enabled reporter-initial-delay: 60 # Interval of monitoring serialization: jackson # Serialization method, do not change compressor: none # Compression method for raftlog, such as gzip, zstd, etc. sync: true # Flushing method for raft log, default is synchronous flushing config: # support: nacos, consul, apollo, zk, etcd3 type: file # This configuration can choose different configuration centers registry: # support: nacos, eureka, redis, zk, consul, etcd3, sofa type: file # Non-file registration center is not allowed in raft mode store: # support: file, db, redis, raft mode: raft # Use raft storage mode file: dir: sessionStore # This path is the storage location of raftlog and related transaction logs, default is relative path, it is better to set a fixed location ``` if we decide to go with controller/central group rf =3 then we should add a "controller" configuration item to the configuration file, allowing us to configure which nodes will have the controller. ``` seata: server: raft: group: default # This value represents the group of this raft cluster, and the value corresponding to the client's transaction group should match it. server-addr: 192.168.0.111:9091,192.168.0.112:9091,192.168.0.113:9091 # IP and port of the 3 nodes, the port is the netty port of the node + 1000, default netty port is 8091 controller: 192.168.0.111:7091, 192.168.0.112:7091, 192.168.0.113:7091 snapshot-interval: 600 # Take a snapshot every 600 seconds for fast rolling of raftlog. However, making a snapshot every 600 seconds may cause business response time jitter if there is too much transaction data in memory. But it is friendly for fault recovery and faster node restart. You can adjust it to 30 minutes, 1 hour, etc., according to the business. You can test whether there is jitter on your own, and find a balance point between rt jitter and fault recovery. apply-batch: 32 # At most, submit raftlog once for 32 batches of actions max-append-bufferSize: 262144 # Maximum size of the log storage buffer, default is 256K max-replicator-inflight-msgs: 256 # In the case of enabling pipeline requests, the maximum number of in-flight requests, default is 256 disruptor-buffer-size: 16384 # Internal disruptor buffer size. If it is a scenario with high write throughput, you need to appropriately increase this value. Default is 16384 election-timeout-ms: 1000 # How long without a leader's heartbeat to start a new election reporter-enabled: false # Whether the monitoring of raft itself is enabled reporter-initial-delay: 60 # Interval of monitoring serialization: jackson # Serialization method, do not change compressor: none # Compression method for raftlog, such as gzip, zstd, etc. sync: true # Flushing method for raft log, default is synchronous flushing config: # support: nacos, consul, apollo, zk, etcd3 type: file # This configuration can choose different configuration centers registry: # support: nacos, eureka, redis, zk, consul, etcd3, sofa type: file # Non-file registration center is not allowed in raft mode store: # support: file, db, redis, raft mode: raft # Use raft storage mode file: dir: sessionStore # This path is the storage location of raftlog and related transaction logs, default is relative path, it is better to set a fixed location ``` or in case of more nodes ``` seata: server: raft: group: default # This value represents the group of this raft cluster, and the value corresponding to the client's transaction group should match it. server-addr: 192.168.0.111:7091, 192.168.0.112:7091, 192.168.0.113:7091, 192.168.0.114:7091, 192.168.0.115:7091 controller: 192.168.0.111:7091, 192.168.0.112:7091, 192.168.0.113:7091 snapshot-interval: 600 # Take a snapshot every 600 seconds for fast rolling of raftlog. However, making a snapshot every 600 seconds may cause business response time jitter if there is too much transaction data in memory. But it is friendly for fault recovery and faster node restart. You can adjust it to 30 minutes, 1 hour, etc., according to the business. You can test whether there is jitter on your own, and find a balance point between rt jitter and fault recovery. apply-batch: 32 # At most, submit raftlog once for 32 batches of actions max-append-bufferSize: 262144 # Maximum size of the log storage buffer, default is 256K max-replicator-inflight-msgs: 256 # In the case of enabling pipeline requests, the maximum number of in-flight requests, default is 256 disruptor-buffer-size: 16384 # Internal disruptor buffer size. If it is a scenario with high write throughput, you need to appropriately increase this value. Default is 16384 election-timeout-ms: 1000 # How long without a leader's heartbeat to start a new election reporter-enabled: false # Whether the monitoring of raft itself is enabled reporter-initial-delay: 60 # Interval of monitoring serialization: jackson # Serialization method, do not change compressor: none # Compression method for raftlog, such as gzip, zstd, etc. sync: true # Flushing method for raft log, default is synchronous flushing config: # support: nacos, consul, apollo, zk, etcd3 type: file # This configuration can choose different configuration centers registry: # support: nacos, eureka, redis, zk, consul, etcd3, sofa type: file # Non-file registration center is not allowed in raft mode store: # support: file, db, redis, raft mode: raft # Use raft storage mode file: dir: sessionStore # This path is the storage location of raftlog and related transaction logs, default is relative path, it is better to set a fixed location ``` Also as you can guess we are using the shared setup (as kafka calls it when they have controller and brokers on the same node), but we can always move it to separate nodes in the future if that will improve the performance. this image shows the difference between the two [from kafka](https://developer.confluent.io/courses/architecture/control-plane/)  --- **Raft Group Metadata:** Mainly we will need to update the `ClusterController` class to have the ability to return few additional meta data like: How many groups are on specific node? How many active raft groups? Add new config/node. --- **Loadbalancing:** we need to talk about loadbalancing to ensure optimal performance, but for now we can use the available ones, and see if we need to create new specific ones for multi-raft in the future. --- **List of future features we can add to multi-raft:** Improved Loadbalancing / specialized algorithm Rebalancing (manual and auto) support of resiliency config or Kafka like config support of increased number of replica factor 5/7/11 Improved scaling down algorithms Metrics and performance measurement between different deployment types. GitHub link: https://github.com/apache/incubator-seata/discussions/7404 ---- This is an automatically sent email for dev@seata.apache.org. To unsubscribe, please send an email to: dev-unsubscr...@seata.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@seata.apache.org For additional commands, e-mail: dev-h...@seata.apache.org