(skywalking-banyandb) 01/01: Add the property background repair strategy documentation

liuhan Fri, 11 Jul 2025 05:14:06 -0700

This is an automated email from the ASF dual-hosted git repository.

liuhan pushed a commit to branch proeprty-backgroud-repair-process
in repository https://gitbox.apache.org/repos/asf/skywalking-banyandb.git


commit 185d0067a3a187cd85ab20a2a2dde8b25718a4dd
Author: mrproliu <741550...@qq.com>
AuthorDate: Fri Jul 11 20:13:48 2025 +0800

    Add the property background repair strategy documentation
---
 docs/concept/property-repair.md | 142 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 142 insertions(+)

diff --git a/docs/concept/property-repair.md b/docs/concept/property-repair.md
new file mode 100644
index 00000000..27fe6b1e
--- /dev/null
+++ b/docs/concept/property-repair.md
@@ -0,0 +1,142 @@
+# Property Background Repair Strategy
+
+This document introduces how `property` type data is handled in a cluster 
deployment, 
+how the system automatically verifies cross-machine multi-replica data 
consistency in the background and 
+performs data recovery operations.
+
+The process can be roughly divided into the following two steps:
+1. **Build Merkel Tree**: Automatically build anti-entropy Merkle Trees in the 
background for efficient comparison in subsequent operations.
+2. **Recovery Data through gossip**: Use the gossip protocol among data nodes 
to detect and repair data inconsistencies.
+
+## Build Merkel Tree
+
+The Merkel Tree is a data structure used to efficiently compare node 
information across two different data nodes, allowing 
+the data nodes to quickly determine whether their data is consistent.
+
+### Structure
+In the context of property data, each shard within a group builds its own 
Merkle Tree, Since the Property data is finite.
+The tree consists of the following three types of nodes:
+1. **Leaf Node**: Store the summary information of each Property data:
+   1. **Entity**: The identifier of the Property data, composed of 
`group_name` + `property_name` + `id`.
+   2. **SHA Value**: The SHA512 hash value of the Property source data, used 
for fast equality comparison to check the consistency of the same entity data.
+2. **Slot Node**: Each tree contains a fixed number of slot nodes. When the 
Property data is added into the tree, it is placed in a slot node based on its 
hash value of Entity. 
+   The slot node contains the SHA value of the Property data and the number of 
Property data in that slot.
+3. **Root Node**: The root node of the Merkle Tree, which contains the SHA 
value of the entire tree and the number of slot nodes.
+
+Therefore, the Merkle Tree have above three structure levels:
+* The top level is a single root node.
+* The middle level is a fixed number of slot nodes.
+* The bottom level is a variable number of leaf nodes.
+
+### Timing and Building Process
+
+There are two types of triggers for Merkle Tree construction:
+1. **Fixed Interval**: By default, the system automatically triggers the 
construction every hour (this interval is configurable).
+2. **On Update**: When an update in the shard is detected, the system 
schedules a delayed build after a short wait period (default 10 minutes).
+
+The construction process follows these steps:
+1. **Check for Updates**: The system compares the snapshot ID(`XXXXX.snp`) of 
the previously built tree with the current snapshot ID of the shard. 
+   If they differ, it indicates that data has changed, and the process 
continues. If they match, the tree construction is skipped.
+2. **Snapshot the Shard**: A snapshot of the shard data is taken to avoid 
blocking ongoing business operations during data traversal.
+3. **Build the Tree**: Using the streaming method, the system scans all data 
in the snapshot and builds a Merkle Tree for each group individually.
+4. **Save the Snapshot ID**: The snapshot ID used in this build is saved, so 
it can be used for efficient change detection during the next scheduled run.
+
+## Gossip Protocol
+
+In BanyanDB's typical data communication model, data exchange is primarily 
handled by **liaison** nodes, which interact with **data nodes** through 
**broadcasting or publishing** mechanisms.
+However, in the context of Property data repair, involving liaison nodes would 
introduce unnecessary network overhead. 
+Therefore, the system adopts the gossip protocol, allowing data nodes to 
communicate directly with each other to perform data repair in a more efficient 
and decentralized manner.
+
+Unlike typical gossip-based message dissemination, where messages are spread 
randomly, the number of **data nodes** in BanyanDB is fixed and relatively 
small.
+To ensure greater stability, the random peer selection is replaced with a 
deterministic strategy, where each node communicates only with its next node in 
the sequence.
+Additionally, to minimize network traffic, only one peer node is involved in 
each round of communication.
+
+BanyanDB already has a built-in [cluster discovery and data 
transmission](./clustering.md) mechanism. Therefore, the implementation of 
gossip protocol can be built as an extension on top of the existing cluster 
protocol.
+Since each node already exposes a gRPC port, there is no need to introduce any 
additional ports, simplifying the deployment and integration.
+
+### Propagation Message
+
+When initiating gossip message propagation, the sender node must include both 
the list of participating nodes and the message content. 
+The gossip protocol then proceeds through the following steps:
+
+1: **Build the Context**: A context object is attached to each message and 
includes the following parameters:
+   1. **Node List**: A list of participating nodes used to determine the next 
node in the sequence.
+   2. **Maximum Count**: The maximum number of message transmissions, 
calculated as `(node_count) * 2 - 3`. The rationale behind this formula will be 
explained in an upcoming example.
+   3. **Origin Node ID**: The ID of the node that initiated the gossip message 
propagation, used to return the final result. 
+   4. **Original Message ID**: A unique identifier for the original message, 
allowing the origin node to track which a gossip message propagation process 
has completed.
+2. **Send to the First Node**: If the first node in the list is the sender 
itself, the process begins locally. Otherwise, the message is sent to the first 
node in the list.
+3. **Receive Gossip Message**: Upon receiving the message, the node identifies 
the next node in the sequence and prepares for peer-to-peer interaction.
+4. **Protocol-Level Handling**: The Property repair logic runs its own 
protocol between the current node and the next node, ensuring their Property 
data is synchronized. (Details of the two-node sync protocol will be covered in 
a later section.)
+5. **Handle Result**: If the sync is successful, the process keeps continuing. 
If it fails, the flow jumps to step 7.
+6. **Forward to the Next Node**: The maximum count in the context is 
decremented by one to indicate a completed round. 
+   * If the maximum count reaches zero, the process is considered complete, 
and it proceeds to step 7.
+   * Otherwise, the message is forwarded to the next node, repeating steps 3–6.
+7. **Send Result to Origin Node**: The current node sends the result—either 
success or failure—to the origin node specified within the context.
+8. **Origin Node Receives the Result**: The initiating sender node receives 
the final outcome of the gossip protocol and can proceed with any 
post-processing logic.
+
+### Example of Message Propagation
+
+To illustrate the gossip message propagation process, consider a scenario with 
three nodes with the version: A(version 2), B(version 1), C(version 3), and the 
sender is node B. The sequence of operations is as follows:
+
+1. B node building the context, and the maximum count is calculated as `(3) * 
2 - 3 = 3`.
+2. B node sends the gossip message to A as its first node in the nodes list.
+3. A node receives the message, identifies B as the next node, and do 
peer-to-peer interaction, current nodes and the version is: A(version 2), 
B(version 2), C(version 3)
+4. A node sends the gossip message to B, then B node receives the message, 
identifies C as the next node, and do peer-to-peer interaction, current nodes 
and the version is: A(version2), B(version 3), C(version 3)
+5. B node sends the gossip message to C, then C node receives the message, 
identifies A as the next node, and do peer-to-peer interaction, current nodes 
and the version is: A(version 3), B(version 3), C(version 3)
+6. finally, the maximum count decremented to the zero, means the gossip 
propagation is completed, and C node sends the result the original node B.
+7. B node received the result, and the gossip propagation is completed.
+
+### Tracing
+
+Tracing becomes critically important in a gossip-based protocol, as issues can 
arise at any stage of the communication and synchronization process.
+New trace spans are created in the following scenarios:
+
+1. **Initiating Node**: The initiating node records the full trace of the 
entire request, capturing all steps from message dispatch to the final result 
collection.
+2. **Receiving Sync Requests**: When a node receives a sync request and begins 
communication with another peer node, it creates a new trace span for that 
interaction to track the request lifecycle independently.
+3. **Business Logic Execution**: During the actual business processing (e.g., 
data comparison, update, and transmission), custom trace data and metadata can 
be attached to the active trace context to provide deeper insights into 
logic-specific behaviors.
+
+After each synchronization cycle, the receiving node will send its trace data 
back to the initiating node, allowing the initiator to aggregate and correlate 
all spans for end-to-end visibility and debugging.
+
+## Property Repair
+
+Once all the above preparations are complete, the system can proceed with the 
Property Repair process.
+This process is scheduled to run on each data node daily at 1:00 AM, and 
follows these steps:
+1. **Select a Group**: The node retrieves a list of Property groups where the 
number of **replicas is greater than or equal to 2**, and randomly selects one 
group for repair.
+2. **Query Node List**: Then determines the list of nodes that hold replicas 
for the selected group and send the gossip propagation message to those nodes 
for synchronize the Property data for that group.
+3. **Wait for the Result**: The initiating node waits for the final result of 
the synchronization process before proceeding.
+
+### Property Synchronize between Two Nodes
+
+When two nodes engage in Property data synchronization, they follow a specific 
protocol to ensure data consistency.
+Let’s refer to the current node as A and the target node as B. The process is 
as follows:
+
+1. **(A)Establish Connection**: Node A initiates a **bidirectional streaming 
connection** with node B to enable real-time, two-way data transfer.
+2. **(A)Iterate Over Shards**: Node A retrieves the list of all 
Property-related shards for the selected group and processes them one by one.
+3. **(A)Send Merkle Tree Summary**: For each shard, node A reads its Merkle 
Tree and sends a summary (including root SHA and slots SHA) to node B. 
+This allows B to quickly identify which slots may contain differences.
+4. **(B)Verify Merkle Tree Summary and Respond**: Node B compares the received 
summary against its own Merkle Tree for the same shard and group:
+   * If the root SHA match, node B returns an empty slot list, indicating no 
differences. 
+   * If the root SHA differ, node B checks the slot SHA, identifies mismatched 
slots, and sends back all relevant leaf node details, including the **slot 
index**, **entity**, and **SHA value**.
+5. **(A)Compare Leaf Data**: Node A processes the received leaf data and takes 
the following actions: 
+   * For missing entities (present on B but not on A), A requests the full 
Property data from B.
+   * For entities present on A, but not on B, A sends the full Property data 
to B.
+   * For SHA mismatches, A sends its full Property data to B for validation.
+6. **(B)Validate Actual Data**: Node B handles the data as follows: 
+   * For missing entities, B returns the latest version of the data.
+   * For entities present on A, but not on B, B updates its local copy with 
the data from A.
+   * For SHA mismatches, B compares the version numbers, If B’s version is 
newer, it returns the Property data to A. If A’s version is newer, B updates 
its local copy and does not return any data.
+7. **(A)Update Local Data**: Node A receives updated Property data from B and 
applies the changes to its local store.
+
+This concludes the A-to-B synchronization cycle for a given group in the 
Property repair process.
+
+### Error Handling
+
+During the synchronization process, the system will terminate or skip 
processing under the following scenarios:
+
+1. **Merkle Tree Not Built**: If either node A or node B has not yet built the 
Merkle Tree for the target group, the gossip protocol is immediately terminated.
+2. **Duplicate Sync Request**: If a new gossip sync request for the same group 
is received by either node while an existing synchronization is in progress, 
the new request is terminated to avoid conflicts.
+3. **Target Node Request Failure**: If node B fails to send or respond to 
requests during the sync process, the gossip protocol is terminated.
+4. **Property Repair Failure**: If an error occurs while applying Property 
data updates (e.g., write or query failure), the affected entity is added to a 
`repair failure list`. 
+This list is included in subsequent gossip message propagation to indicate 
that the entity should be skipped for future repair attempts.
+5. **Unhandled Exceptions**: For any other unexpected exceptions or failures, 
the gossip protocol is immediately terminated to maintain system consistency 
and avoid cascading errors.
+

(skywalking-banyandb) 01/01: Add the property background repair strategy documentation

Reply via email to