Re: [PR] add en replica recovery [incubator-pegasus-website]

via GitHub Wed, 16 Apr 2025 21:22:22 -0700


empiredan commented on code in PR #108:
URL: 
https://github.com/apache/incubator-pegasus-website/pull/108#discussion_r2048188855



##########
_docs/en/administration/replica-recovery.md:
##########
@@ -6,7 +6,17 @@ permalink: administration/replica-recovery
 
 Generally speaking, data in Pegasus is stored with 3 replicas. For each 
partition, under normal situation, there should be one primary replica and two 
secondary replicas, totaling three replicas providing service.
 
-However, it is inevitable that the cluster will experience node crashes, 
network anomalies, heartbeat disconnections, and other situations that can 
cause replica loss, affecting the availability of services. The degree of 
replica loss affects the ability to read and write (introduced in [Load 
Balancing](rebalance#conceptual) as well):
+However, node failures,network issues, and heartbeat loss are inevitable in a 
cluster, leading to replica loss and affecting service availability. Pegasus 
has three detection mechanisms to identify replica loss:

Review Comment:
   ```suggestion
   However, node failures, network issues, and heartbeat loss are inevitable in 
a cluster, leading to replica loss and affecting service availability. Pegasus 
has three detection mechanisms to identify replica loss:
   ```



##########
_docs/en/administration/replica-recovery.md:
##########
@@ -6,7 +6,17 @@ permalink: administration/replica-recovery
 
 Generally speaking, data in Pegasus is stored with 3 replicas. For each 
partition, under normal situation, there should be one primary replica and two 
secondary replicas, totaling three replicas providing service.
 
-However, it is inevitable that the cluster will experience node crashes, 
network anomalies, heartbeat disconnections, and other situations that can 
cause replica loss, affecting the availability of services. The degree of 
replica loss affects the ability to read and write (introduced in [Load 
Balancing](rebalance#conceptual) as well):
+However, node failures,network issues, and heartbeat loss are inevitable in a 
cluster, leading to replica loss and affecting service availability. Pegasus 
has three detection mechanisms to identify replica loss:
+
+* 2PC timeout: Mainly ensures the health of the primary-secondary replica 
relationship. This is a replica-level failure detection, triggered each time a 
write enters the 2PC process.
+
+* failure_detect: Uses a lease mechanism to ensure the connectivity between 
the meta server and replica server. This is a server-level failure detection 
mechanism that can quickly identify a node's availability issue. The default 
interval in production is 3 seconds.
+
+* group_check: A task initiated when a replica becomes the primary. It 
periodically sends RPCs to secondaries to check their liveness. The default 
interval in production is 100 seconds.
+
+Among them, 2PC timeout and group_check help the primary detect connection 
issues with its secondaries and remove faulty replicas from the topology, 
reporting them to meta. failure_detect helps the meta server identify faulty 
replica nodes and remove all their replicas from the topology.

Review Comment:
   ```suggestion
   Among them, 2PC timeout and group_check help the primary detect connection 
issues with its secondaries and remove faulty replicas from the topology, 
reporting them to meta server. failure_detect helps the meta server identify 
faulty replica nodes and remove all their replicas from the topology.
   ```



##########
_docs/en/administration/replica-recovery.md:
##########
@@ -6,7 +6,17 @@ permalink: administration/replica-recovery
 
 Generally speaking, data in Pegasus is stored with 3 replicas. For each 
partition, under normal situation, there should be one primary replica and two 
secondary replicas, totaling three replicas providing service.
 
-However, it is inevitable that the cluster will experience node crashes, 
network anomalies, heartbeat disconnections, and other situations that can 
cause replica loss, affecting the availability of services. The degree of 
replica loss affects the ability to read and write (introduced in [Load 
Balancing](rebalance#conceptual) as well):
+However, node failures,network issues, and heartbeat loss are inevitable in a 
cluster, leading to replica loss and affecting service availability. Pegasus 
has three detection mechanisms to identify replica loss:
+
+* 2PC timeout: Mainly ensures the health of the primary-secondary replica 
relationship. This is a replica-level failure detection, triggered each time a 
write enters the 2PC process.

Review Comment:
   ```suggestion
   * 2PC timeout: Mainly ensures the health of the primary-secondary replica 
relationship. This is a replica-level failure detection, triggered each time a 
write enters the 2PC phase.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] add en replica recovery [incubator-pegasus-website]

Reply via email to