This is an automated email from the ASF dual-hosted git repository.
jinrongtong pushed a commit to branch develop
in repository https://gitbox.apache.org/repos/asf/rocketmq.git
The following commit(s) were added to refs/heads/develop by this push:
new dd4d188bd [ISSUE #5789]Translation controller design.md chinese
document to english document (#5790)
dd4d188bd is described below
commit dd4d188bd1efe863a9c39b551a931b52d2886dab
Author: mxsm <[email protected]>
AuthorDate: Sat Dec 31 11:42:41 2022 +0800
[ISSUE #5789]Translation controller design.md chinese document to english
document (#5790)
---
docs/en/controller/design.md | 192 +++++++++++++++++++++
docs/en/image/controller/controller_design_1.png | Bin 0 -> 427708 bytes
docs/en/image/controller/controller_design_3.png | Bin 0 -> 77603 bytes
docs/en/image/controller/controller_design_4.png | Bin 0 -> 56097 bytes
docs/en/image/controller/controller_design_5.png | Bin 0 -> 120030 bytes
docs/en/image/controller/controller_design_6.png | Bin 0 -> 13397 bytes
.../en/image/controller/quick-start/controller.png | Bin 0 -> 268136 bytes
7 files changed, 192 insertions(+)
diff --git a/docs/en/controller/design.md b/docs/en/controller/design.md
new file mode 100644
index 000000000..ba2de58af
--- /dev/null
+++ b/docs/en/controller/design.md
@@ -0,0 +1,192 @@
+# Background
+
+In the current RocketMQ Raft mode, the DLedger Commitlog is mainly used to
replace the original Commitlog, enabling the Commitlog to have the ability to
elect and replicate. However, this also causes some problems:
+
+- In the Raft mode, the number of replicas within the Broker group must be
three or more, and the ACK of the replicas must also follow the majority
protocol.
+- RocketMQ has two sets of HA replication processes, and the replication in
Raft mode cannot utilize RocketMQ's native storage capability.
+
+Therefore, we hope to use DLedger to implement a consistency module (DLedger
Controller) based on Raft, and use it as an optional leader election component.
It can be deployed independently or embedded in the Nameserver. The Broker
completes the election of the Master through interaction with the Controller,
thus solving the above problems. We refer to this new mode as the Controller
mode.
+
+# Architecture
+
+### Core idea
+
+
+
+- The following is a description of the core architecture of the Controller
mode, as shown in the figure:
+ - DledgerController: Using DLedger, a DLedger controller that ensures the
consistency of metadata is constructed. The Raft election will select an Active
DLedger Controller as the main controller. The DLedger Controller can be
embedded in the Nameserver or deployed independently. Its main function is to
store and manage the SyncStateSet list of Brokers, and actively issue
scheduling instructions to switch the Master of the Broker when the Master of
the Broker is offline or network isolated.
+ - SyncStateSet: Mainly represents a set of Slave replicas following the
Master in a broker replica group, with the main criterion for judgment being
the gap between the Master and the Slave. When the Master is offline, we will
select a new Master from the SyncStateSet list. The SyncStateSet list is mainly
initiated by the Master Broker. The Master completes the Shrink and Expand of
the SyncStateSet through a periodic task to determine and synchronize the
SyncStateSet, and initiates an [...]
+ - AutoSwitchHAService: A new HAService that, based on DefaultHAService,
supports the switching of BrokerRole and the mutual conversion between Master
and Slave (under the control of the Controller). In addition, this HAService
unifies the log replication process and truncates the logs during the HA
HandShake stage.
+ - ReplicasManager: As an intermediate component, it serves as a link between
the upper and lower levels. Upward, it can regularly synchronize control
instructions from the Controller, and downward, it can regularly monitor the
state of the HAService and modify the SyncStateSet at the appropriate time. The
ReplicasManager regularly synchronizes metadata about the Broker from the
Controller, and when the Controller elects a new Master, the ReplicasManager
can detect the change in metadat [...]
+
+## DLedgerController core design
+
+
+
+- The following is a description of the core design of the DLedgerController:
+ - DLedgerController can be embedded in Namesrv or deployed independently.
+ - Active DLedgerController is the Leader elected by DLedger. It will accept
event requests from clients and initiate consensus through DLedger, and finally
apply them to the in-memory metadata state machine.
+ - Not Active DLedgerController, also known as the Follower role, will
replicate the event logs from the Active DLedgerController through DLedger and
then apply them directly to the state machine.
+
+## Log replication
+
+### Basic concepts and processes
+
+In order to unify the log replication process, distinguish the log replication
boundary of each Master, and facilitate log truncation, the concept of
MasterEpoch is introduced, which represents the current Master's term number
(similar to the meaning of Raft Term).
+
+For each Master, it has a MasterEpoch and a StartOffset, which respectively
represent the term number and the starting log offset of the Master.
+
+It should be noted that the MasterEpoch is determined by the Controller and is
monotonically increasing.
+
+In addition, we have introduced the EpochFile, which is used to store the
\<Epoch, StartOffset\> sequence.
+
+**When a Broker becomes the Master, it will:**
+
+- Truncate the Commitlog to the boundary of the last message.
+- Persist the latest \<MasterEpoch, startoffset\> to the EpochFile, where
startOffset is the current CommitLog's MaxPhyOffset.
+- Then the HAService listens for connections and creates the HAConnection to
interact with the Slave to complete the process.
+
+**When a Broker becomes the Slave, it will:**
+
+Ready stage:
+
+- Truncate the Commitlog to the boundary of the last message.
+- Establish a connection with the Master.
+
+Handshake stage:
+
+- Perform log truncation, where the key is for the Slave to compare its local
epoch and startOffset with the Master to find the log truncation point and
perform log truncation.
+
+Transfer stage:
+
+- Synchronize logs from the Master.
+
+### Truncation algorithm
+
+The specific log truncation algorithm flow is as follows:
+
+- During the Handshake stage, the Slave obtains the Master's EpochCache from
the Master.
+- The Slave compares the obtained Master EpochCache \<Startoffset,
Endoffset\>, and compares them with the local cache from back to front. If the
Epoch and StartOffset of the two are equal, the Epoch is valid, and the
truncation point is the smaller Endoffset between them. After truncation, the
\<Epoch, Startoffse\t> information is corrected and enters the Transfer stage.
If they are not equal, the previous epoch of the Slave is compared until the
truncation point is found.
+
+```java
+slave:TreeMap<Epoch, Pair<startOffset,endOffset>> epochMap;
+Iterator iterator = epochMap.entrySet().iterator();
+truncateOffset = -1;
+
+//The epochs are sorted from largest to smallest
+while (iterator.hasNext()) {
+ Map.Entry<Epoch, Pair<startOffset,endOffset>> curEntry = iterator.next();
+ Pair<startOffset,endOffset> masterOffset=
+ findMasterOffsetByEpoch(curEntry.getKey());
+
+ if(masterOffset != null &&
+ curEntry.getKey().getObejct1() == masterOffset.getObejct1()) {
+ truncateOffset = Math.min(curEntry.getKey().getObejct2(),
masterOffset.getObejct2());
+ break;
+ }
+}
+```
+
+### Replication process
+
+Since HA replicates logs based on stream, we cannot distinguish the boundaries
of the logs (that is, a batch of transmitted logs may span multiple
MasterEpochs), and the Slave cannot detect changes in MasterEpoch and cannot
timely modify EpochFile.
+
+Therefore, we have made the following improvements:
+
+When the Master transfers logs, it ensures that a batch of logs sent at a time
is in the same epoch, but not spanning multiple epochs. We can add two
variables in WriteSocketService:
+
+- currentTransferEpoch: represents which epoch
WriteSocketService.nextTransferFromWhere belongs to
+- currentTransferEpochEndOffset: corresponds to the end offset of
currentTransferEpoch. If currentTransferEpoch == MaxEpoch, then
currentTransferEpochEndOffset= -1, indicating no boundary.
+
+When WriteSocketService transfers the next batch of logs (assuming the total
size of this batch is size), if it finds that nextTransferFromWhere + size >
currentTransferEpochEndOffset, it sets selectMappedBufferResult limit to
currentTransferEpochEndOffset. Finally, modify currentTransferEpoch and
currentTransferEpochEndOffset to the next epoch.
+
+Correspondingly, when the Slave receives logs, if it finds a change in epoch
from the header, it records it in the local epoch file.
+
+### Replication protocol
+
+According to the above, we can know the AutoSwitchHaService protocol divides
log replication into multiple stages. Below is the protocol for the HaService.
+
+#### Handshake stage
+
+1.AutoSwitchHaClient (Slave) will send a HandShake packet to the Master as
follows:
+
+
+
+`current state(4byte) + Two flags(4byte) + slaveAddressLength(4byte) +
slaveAddress(50byte)`
+
+- `Current state` represents the current HAConnectionState, which is HANDSHAKE.
+
+- Two flags are two status flags, where `isSyncFromLastFile` indicates whether
to start copying from the Master's last file, and `isAsyncLearner` indicates
whether the Slave is an asynchronous copy and joins the Master as a Learner.
+
+- `slaveAddressLength` and `slaveAddress` represent the address of the Slave,
which will be used later to join the SyncStateSet.
+
+2.AutoSwitchHaConnection (Master) will send a HandShake packet back to the
Slave as follows:
+
+
+
+`current state(4byte) + body size(4byte) + offset(8byte) + epoch(4byte) + body`
+
+- `Current state` represents the current HAConnectionState, which is HANDSHAKE.
+- `Body size` represents the length of the body.
+- `Offset` represents the maximum offset of the log on the Master side.
+- `Epoch` represents the Master's Epoch.
+- The Body contains the EpochEntryList on the Master side.
+
+After the Slave receives the packet sent back by the Master, it will perform
the log truncation process described above locally.
+
+#### Transfer stage
+
+1.AutoSwitchHaConnection (Master) will continually send log packets to the
Slave as follows:
+
+
+
+`current state(4byte) + body size(4byte) + offset(8byte) + epoch(4byte) +
epochStartOffset(8byte) + additionalInfo(confirmOffset) (8byte)+ body`
+
+- `Current state`: represents the current HAConnectionState, which is Transfer.
+- `Body size`: represents the length of the body.
+- `Offset`: the starting offset of the current batch of logs.
+- `Epoch`: represents the MasterEpoch to which the current batch of logs
belongs.
+- `epochStartOffset`: represents the StartOffset of the MasterEpoch
corresponding to the current batch of logs.
+- `confirmOffset`: represents the minimum offset among replicas in
SyncStateSet.
+- `Body`: logs.
+
+2.AutoSwitchHaClient (Slave) will send an ACK packet to the Master:
+
+
+
+` current state(4byte) + maxOffset(8byte)`
+
+- `Current state`: represents the current HAConnectionState, which is Transfer.
+- `MaxOffset`: represents the current maximum log offset of the Slave.
+
+## Elect Master
+
+### Basic process
+
+ELectMaster mainly selects a new Master from the SyncStateSet list when the
Master of a Broker replica group is offline or inaccessible. This event is
initiated by the Controller itself or through the `electMaster` operation
command.
+
+Whether the Controller is deployed independently or embedded in Namesrv, it
listens to the connection channels of each Broker. If a Broker channel becomes
inactive, it checks whether the Broker is the Master, and if so, it triggers
the Master election process.
+
+The process of electing a Master is relatively simple. We just need to select
one from the SyncStateSet list corresponding to the group of Brokers and make
it the new Master, and apply the result to the in-memory metadata through the
DLedger consensus. Finally, the result is notified to the corresponding Broker
replica group.
+
+### SyncStateSet change
+
+SyncStateSet is an important basis for electing a Master. Changes to the
SyncStateSet list are mainly initiated by the Master Broker. The Master
completes the Shrink and Expand of SyncStateSet through a periodic task and
initiates an Alter SyncStateSet request to the election component Controller
during the synchronization process.
+
+#### Shrink
+
+Shrink SyncStateSet refers to the removal of replicas from the SyncStateSet
replica set that are significantly behind the Master, based on the following
criteria:
+
+- Increase the haMaxTimeSlaveNotCatchUp parameter.
+- HaConnection records the last time the Slave caught up with the Master's
timestamp, lastCaughtUpTimeMs, which means: every time the Master sends data
(transferData) to the Slave, it records its current MaxOffset as
lastMasterMaxOffset and the current timestamp lastTransferTimeMs.
+- When ReadSocketService receives slaveAckOffset, if slaveAckOffset >=
lastMasterMaxOffset, it updates lastCaughtUpTimeMs to lastTransferTimeMs.
+- The Master scans each HaConnection through a periodic task and if (cur_time
- connection.lastCaughtUpTimeMs) > haMaxTimeSlaveNotCatchUp, the Slave is
Out-of-sync.
+- If a Slave is detected to be out of sync, the master immediately reports
SyncStateSet to the Controller, thereby shrinking SyncStateSet.
+
+#### Expand
+
+If a Slave replica catches up with the Master, the Master needs to timely
alter SyncStateSet with the Controller. The condition for adding to
SyncStateSet is slaveAckOffset >= ConfirmOffset (the minimum value of MaxOffset
among all replicas in the current SyncStateSet).
+
+## Reference
+
+[RIP-44](https://github.com/apache/rocketmq/wiki/RIP-44-Support-DLedger-Controller)
diff --git a/docs/en/image/controller/controller_design_1.png
b/docs/en/image/controller/controller_design_1.png
new file mode 100644
index 000000000..fea825641
Binary files /dev/null and b/docs/en/image/controller/controller_design_1.png
differ
diff --git a/docs/en/image/controller/controller_design_3.png
b/docs/en/image/controller/controller_design_3.png
new file mode 100644
index 000000000..8c475bcec
Binary files /dev/null and b/docs/en/image/controller/controller_design_3.png
differ
diff --git a/docs/en/image/controller/controller_design_4.png
b/docs/en/image/controller/controller_design_4.png
new file mode 100644
index 000000000..308b93627
Binary files /dev/null and b/docs/en/image/controller/controller_design_4.png
differ
diff --git a/docs/en/image/controller/controller_design_5.png
b/docs/en/image/controller/controller_design_5.png
new file mode 100644
index 000000000..01b33cab2
Binary files /dev/null and b/docs/en/image/controller/controller_design_5.png
differ
diff --git a/docs/en/image/controller/controller_design_6.png
b/docs/en/image/controller/controller_design_6.png
new file mode 100644
index 000000000..a909a7037
Binary files /dev/null and b/docs/en/image/controller/controller_design_6.png
differ
diff --git a/docs/en/image/controller/quick-start/controller.png
b/docs/en/image/controller/quick-start/controller.png
new file mode 100644
index 000000000..d7ffed6b8
Binary files /dev/null and
b/docs/en/image/controller/quick-start/controller.png differ