lizhimins opened a new issue, #4778:
URL: https://github.com/apache/rocketmq/issues/4778

   # Design of log replication transfer protocol
   
   In a distributed system, broker or namesrv nodes may go down, so we 
introduce a dynamic multiple replica architecture to achieve high availability 
at the application service level. In the traditional active-standby mode, the 
two replicas cannot achieve majority election and do not have the ability to 
switch by itself. However, the cost of three replicas required by Raft is 
relatively high, the storage layer does not reuse the persistence model of 
RocketMQ's native implementation, and there are some problems such as 
inflexible message confirmation. In the design of “RIP 44”, it has been 
proposed to implement the DLedger Controller consistency agreement on 
NameServer https://shimo.im/docs/N2A1Mz9QZltQZoAD.
   
   The proposal has the following core elements:
   
   1. Multiple DLedger Controllers elect the master through consensus protocol.
   
      1. The assignment of the identity of a single broker replica group is 
completed by the master controller (BrokerId = 0 means the master)
      2. For the non-switching version, you only need to control the controller 
not to elect a new master. It is controlled by the parameter allowElectMaster, 
which realizes the dynamic switching between the selected master and the 
non-selected master version.
   
   2. The Broker obtains the replica group information of the BrokerMemberGroup 
from the Controller by means of event notification, polling and redirection, 
and modifies its own identity.
   
   3. In the traditional architecture, the lease mechanism is generally used to 
avoid multi-master situation.
   
      Considering the network partition, the Controller is unavailable but not 
actively downgraded. In this design, without lease mechanism is not adopted, 
and dual masters cannot be avoided in the case of network partitions.
   
      For example, if the old master is isolated by the network, and 
allAckInSyncStateSet / minInSyncReplica is not configured, the message will 
still be written successfully, and the master and slave are not synchronized.
   
   4. Use Epoch-StartOffset and MaxOffset to determine lastConsistentPoint 
(Offset) and maintain CommitLog and EpochMap
   
   ## Transfer frame format
   
   The newly implemented log service protocol communicates based on frames in 
the format of
   
   frame length (int) + message type (int) + write timestamp (long) + epoch 
(long) + payload (byte[], length = total-24) 
   
   
![image-20220804143508950](https://user-images.githubusercontent.com/22487634/182796635-e1c1280a-2d5f-456f-baa7-b829417d63c5.png)
   
   ## Communication protocol and behavior
   
   ### 1. Five states of the state machine
   
   They are Ready - Handshake - Transfer - Suspend - Shutdown
   
   **Ready phase**: After recover the commitLog, consumeQueue, TimerLog and 
other states, broker will enter the ready state.
   
   At this time, the broker obtains the BrokerMemberGroup from controller, and 
will establish a TCP connection to the master, and the protocol negotiation 
will be completed at this stage.
   
   **HandShake phase**: The master and slave determine the fork point through 
the Epoch-StartOffset mechanism, and determine the replication start point 
according to the configuration to ensure data consistency between the master 
and the slave.
   
   **Transfer stage**: In the normal data transfer stage, the master pushes the 
part commitlog (block data) to the slave, and the slave will reply the current 
processing point to the master to calculate confirm offset
   
   **Suspend phase**: The master and slave are working online at the same time, 
and the heartbeat is maintained in the connection.
   
   **Shutdown phase**: Once process shutdown, HA-related services 
(GroupTransfer HaNotified, etc.) will graceful shutdown;
   
   
![image-20220804143508950](https://user-images.githubusercontent.com/22487634/182796695-7545495e-a01d-4cda-8b57-93cdddce3f5c.png)
   
   ### 2. Communication Protocol
   
   Divided into 3 groups, HA negotiation, offset negotiation and transmission, 
a total of 7 Request Response Code
   
   #### 1. HA negotiation phase
   
   Code: HANDSHAKE_SLAVE - HANDSHAKE_MASTER
   
   The standby initiates a HANDSHAKE_SLAVE request to the master to add itself 
as the slave to obtain the master's data.
   
   1. In the non-elect version, the brokerId will not unchanged, and the slave 
brokerPerm is read-only.
   
   2. In selecting the master version, the brokerId will not be repeated for 
the master because the controller assigns the brokerId.
   
   CheckResult indicates whether the master accepts the slave to establish a 
connection, which is an enumerated value.
   
   If master accept to establish a connection, it is Success. Or the reason for 
disagreement may be ha protocol not supported, identity error such as incorrect 
brokerName. 
   
   #### 2. Offset Negotiation Phase
   
   Code: QUERY_EPOCH - RETURN_EPOCH - CONFIRM_TRUNCATE
   
   Content: The slave broker queries the epoch to the master, the master broker 
returns Map<epoch, start offset>, and the slave confirms to the primary after 
completing the data correction
   
   #### 3. Log copy and transfer phase
   
   Code: TRANSFER_DATA - TRANSFER_ACK
   
   Content: currentBlockEpoch, currentEpochStartOffset, 
currentBlockStartOffset, BlockStartOffset, payload
   
   The master sends the epoch information of the current block, the starting 
point, the current confirmation point and other information to the slave, and 
the slave returns the currently received point after receiving it.
   
   The master will calculate the confirm offset of the replica group according 
to the ack information of the standby node and the configuration of message 
confirmation (minInSyncReplica).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to