vladimirivic commented on issue #953: ZOOKEEPER-3398 Learner.connectToLeader() may take too long to time-out URL: https://github.com/apache/zookeeper/pull/953#issuecomment-494582370 > Thanks for you contribution @vladimirivic . > > I don't fully understand what's the reasoning behind this change. It looks like you're replacing `initLimit` with a new config value which doesn't make sense to me. Why not setting `initLimit` to a lower value? That's a great question. Let me try to elaborate why this config option. In the case of Learner.connectToLeader(), initLimit is used to timeout solely the socket connect operation. Because this method retries up to 5 times, it may leave an ensemble without quorum for too long. Let's say I was using initLimit=20. In case of network issues, this value would leave ensemble without a quorum for almost 2 minutes. The problem can be remediated with use of smaller initLimit value. And I can set it to initLimit=10, or even initLimit=5 so when there are network issues and if we wait for all 5 reties we don't spend to much time without having the quorum. That would solve the problem for the socket but then it may cause issues inside Leader.lead() because initLimit is used by these three methods after the leader gets elected and starts leading: - Leader.getEpochToPropose() - Leader.waitForEpochAck() - Leader.waitForNewLeaderAck() If we go with initLimit smaller than what we tolerate, then that will increase the number of the Exceptions when using ensembles that are not in the same region which will cause quorum issues. Particularly these three from Leader.java: - Timeout while waiting for NEWLEADER to be acked by quorum - Timeout while waiting for epoch from quorum - Timeout while waiting for epoch to be acked by quorum I think it would be good to have a separate config to control socket connection timeout without affecting the quorum timeout so they don't hinder each other. Hopefully this helps shed some light with this pull request. Also, if this sounds like a good idea, maybe we should take into consideration other socket connect calls and use unified config name to address this. Let me know what you think. You can also touch base with @lvfangmin for more specific details about the problems we have seen in production.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
