[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Huang updated ZOOKEEPER-3816:
---------------------------------
    Labels: pull-request-available  (was: )

> Improve the lagging detection between the leader and learners 
> --------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3816
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3816
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>            Reporter: Jie Huang
>            Assignee: Jie Huang
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 3.6.2
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we have SyncLimitCheck on the leader to detect a lagging leaner by 
> tracking the time a proposal being acknowledged. If the leader doesn't 
> receive the ack for a proposal from a learner within the syncLimit, it 
> disconnects the learner. 
> The purpose of the SyncLimitCheck is to prevent sessions connected to a slow 
> learner from being expired.  By disconnecting the slow learner, it gives the 
> clients a chance to re-connect to another server before session expiration. 
> However, there are two cases that the sessions can still expire with current 
> SyncLimitCheck implementation. 
> One case is that the ack reaches the leader on time but a ping response 
> including the session table is delayed. The lagging detection is based on the 
> proposal/ack time yet the sessions are updated when the ping response is 
> received. If the ping response is delayed longer than the ack, the sessions 
> could expire without lagging being detected. It makes more sense to detect 
> lagging based on ping/ping response time. 
> Another case is that the leader detects lagging and closes the connection to 
> the slower learner but the learner doesn't know that it is being disconnected 
> due to long socket closing time or a lost RST signal. So the learner doesn't 
> disconnect its clients, who lose their chance to re-connect to anther server 
> before session expiration. The learner, like the leader, also needs a means 
> to detect communication issues at a higher-than-socket layer.
> So we need a lagging detector based on ping/ping response and bi-directional 
> between the leader and the learners. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to