empiredan commented on code in PR #108: URL: https://github.com/apache/incubator-pegasus-website/pull/108#discussion_r2048188855
########## _docs/en/administration/replica-recovery.md: ########## @@ -6,7 +6,17 @@ permalink: administration/replica-recovery Generally speaking, data in Pegasus is stored with 3 replicas. For each partition, under normal situation, there should be one primary replica and two secondary replicas, totaling three replicas providing service. -However, it is inevitable that the cluster will experience node crashes, network anomalies, heartbeat disconnections, and other situations that can cause replica loss, affecting the availability of services. The degree of replica loss affects the ability to read and write (introduced in [Load Balancing](rebalance#conceptual) as well): +However, node failures,network issues, and heartbeat loss are inevitable in a cluster, leading to replica loss and affecting service availability. Pegasus has three detection mechanisms to identify replica loss: Review Comment: ```suggestion However, node failures, network issues, and heartbeat loss are inevitable in a cluster, leading to replica loss and affecting service availability. Pegasus has three detection mechanisms to identify replica loss: ``` ########## _docs/en/administration/replica-recovery.md: ########## @@ -6,7 +6,17 @@ permalink: administration/replica-recovery Generally speaking, data in Pegasus is stored with 3 replicas. For each partition, under normal situation, there should be one primary replica and two secondary replicas, totaling three replicas providing service. -However, it is inevitable that the cluster will experience node crashes, network anomalies, heartbeat disconnections, and other situations that can cause replica loss, affecting the availability of services. The degree of replica loss affects the ability to read and write (introduced in [Load Balancing](rebalance#conceptual) as well): +However, node failures,network issues, and heartbeat loss are inevitable in a cluster, leading to replica loss and affecting service availability. Pegasus has three detection mechanisms to identify replica loss: + +* 2PC timeout: Mainly ensures the health of the primary-secondary replica relationship. This is a replica-level failure detection, triggered each time a write enters the 2PC process. + +* failure_detect: Uses a lease mechanism to ensure the connectivity between the meta server and replica server. This is a server-level failure detection mechanism that can quickly identify a node's availability issue. The default interval in production is 3 seconds. + +* group_check: A task initiated when a replica becomes the primary. It periodically sends RPCs to secondaries to check their liveness. The default interval in production is 100 seconds. + +Among them, 2PC timeout and group_check help the primary detect connection issues with its secondaries and remove faulty replicas from the topology, reporting them to meta. failure_detect helps the meta server identify faulty replica nodes and remove all their replicas from the topology. Review Comment: ```suggestion Among them, 2PC timeout and group_check help the primary detect connection issues with its secondaries and remove faulty replicas from the topology, reporting them to meta server. failure_detect helps the meta server identify faulty replica nodes and remove all their replicas from the topology. ``` ########## _docs/en/administration/replica-recovery.md: ########## @@ -6,7 +6,17 @@ permalink: administration/replica-recovery Generally speaking, data in Pegasus is stored with 3 replicas. For each partition, under normal situation, there should be one primary replica and two secondary replicas, totaling three replicas providing service. -However, it is inevitable that the cluster will experience node crashes, network anomalies, heartbeat disconnections, and other situations that can cause replica loss, affecting the availability of services. The degree of replica loss affects the ability to read and write (introduced in [Load Balancing](rebalance#conceptual) as well): +However, node failures,network issues, and heartbeat loss are inevitable in a cluster, leading to replica loss and affecting service availability. Pegasus has three detection mechanisms to identify replica loss: + +* 2PC timeout: Mainly ensures the health of the primary-secondary replica relationship. This is a replica-level failure detection, triggered each time a write enters the 2PC process. Review Comment: ```suggestion * 2PC timeout: Mainly ensures the health of the primary-secondary replica relationship. This is a replica-level failure detection, triggered each time a write enters the 2PC phase. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
