[
https://issues.apache.org/jira/browse/ZOOKEEPER-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jie Huang updated ZOOKEEPER-3816:
---------------------------------
Labels: pull-request-available (was: )
> Improve the lagging detection between the leader and learners
> --------------------------------------------------------------
>
> Key: ZOOKEEPER-3816
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3816
> Project: ZooKeeper
> Issue Type: Improvement
> Components: server
> Reporter: Jie Huang
> Assignee: Jie Huang
> Priority: Minor
> Labels: pull-request-available
> Fix For: 3.6.2
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Currently, we have SyncLimitCheck on the leader to detect a lagging leaner by
> tracking the time a proposal being acknowledged. If the leader doesn't
> receive the ack for a proposal from a learner within the syncLimit, it
> disconnects the learner.
> The purpose of the SyncLimitCheck is to prevent sessions connected to a slow
> learner from being expired. By disconnecting the slow learner, it gives the
> clients a chance to re-connect to another server before session expiration.
> However, there are two cases that the sessions can still expire with current
> SyncLimitCheck implementation.
> One case is that the ack reaches the leader on time but a ping response
> including the session table is delayed. The lagging detection is based on the
> proposal/ack time yet the sessions are updated when the ping response is
> received. If the ping response is delayed longer than the ack, the sessions
> could expire without lagging being detected. It makes more sense to detect
> lagging based on ping/ping response time.
> Another case is that the leader detects lagging and closes the connection to
> the slower learner but the learner doesn't know that it is being disconnected
> due to long socket closing time or a lost RST signal. So the learner doesn't
> disconnect its clients, who lose their chance to re-connect to anther server
> before session expiration. The learner, like the leader, also needs a means
> to detect communication issues at a higher-than-socket layer.
> So we need a lagging detector based on ping/ping response and bi-directional
> between the leader and the learners.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)