This is an automated email from the ASF dual-hosted git repository. wangdan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/incubator-pegasus-website.git
The following commit(s) were added to refs/heads/master by this push: new 004c5fa9 Update replica recovery of English version (#108) 004c5fa9 is described below commit 004c5fa9e3a41a17aebcb332f60dabe1cced24bc Author: ninsmiracle <110282526+ninsmira...@users.noreply.github.com> AuthorDate: Thu Apr 17 12:05:54 2025 +0800 Update replica recovery of English version (#108) --- _docs/en/administration/replica-recovery.md | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/_docs/en/administration/replica-recovery.md b/_docs/en/administration/replica-recovery.md index 05e8db4f..9db88a9c 100644 --- a/_docs/en/administration/replica-recovery.md +++ b/_docs/en/administration/replica-recovery.md @@ -6,7 +6,17 @@ permalink: administration/replica-recovery Generally speaking, data in Pegasus is stored with 3 replicas. For each partition, under normal situation, there should be one primary replica and two secondary replicas, totaling three replicas providing service. -However, it is inevitable that the cluster will experience node crashes, network anomalies, heartbeat disconnections, and other situations that can cause replica loss, affecting the availability of services. The degree of replica loss affects the ability to read and write (introduced in [Load Balancing](rebalance#conceptual) as well): +However, node failures, network issues, and heartbeat loss are inevitable in a cluster, leading to replica loss and affecting service availability. Pegasus has three detection mechanisms to identify replica loss: + +* 2PC timeout: Mainly ensures the health of the primary-secondary replica relationship. This is a replica-level failure detection, triggered each time a write enters the 2PC phase. + +* failure_detect: Uses a lease mechanism to ensure the connectivity between the meta server and replica server. This is a server-level failure detection mechanism that can quickly identify a node's availability issue. The default interval in production is 3 seconds. + +* group_check: A task initiated when a replica becomes the primary. It periodically sends RPCs to secondaries to check their liveness. The default interval in production is 100 seconds. + +Among them, 2PC timeout and group_check help the primary detect connection issues with its secondaries and remove faulty replicas from the topology, reporting them to meta server. failure_detect helps the meta server identify faulty replica nodes and remove all their replicas from the topology. + +Through these three detection mechanisms, the meta server detects lost replicas and triggers the subsequent cure process to restore all replicas to a healthy state. The degree of replica loss affects the ability to read and write (introduced in [Load Balancing](rebalance#conceptual) as well): * One primary and two replicas are available: The partition is completely healthy and can **read and write normally**. * One primary and one replica are available: According to the PacificA consistency protocol, it can still **read and write safely**. --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pegasus.apache.org For additional commands, e-mail: commits-h...@pegasus.apache.org