[
https://issues.apache.org/jira/browse/ZOOKEEPER-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251563#comment-14251563
]
Markus Aalto commented on ZOOKEEPER-1748:
-----------------------------------------
In our system we have been hit 2 times by this problem, causing our system to
fail as the leader election does not finish. We have now implemented
application level keep alive support in 3.4.6 version (for internal use) in
such a way that it would support rolling upgrade to 3.5.x from 3.4.6, as 3.4.6
has support for 'versioning' in initiation message in QuorumCnxManager. The
keep alive is implemented in QuorumCnxManager over existing TCP connections.
We are still testing the fix and haven't put it into production use yet, but
will most likely to do so in few weeks.
Has there been any work done for this case? And if not, would it be possible to
submit our proposal fix for this issue for 3.5.x once its ready and tested
properly. The plan would be to make it configurable option for 3.5.
> TCP keepalive for leader election connections
> ---------------------------------------------
>
> Key: ZOOKEEPER-1748
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1748
> Project: ZooKeeper
> Issue Type: Improvement
> Components: leaderElection
> Affects Versions: 3.4.5, 3.5.0
> Environment: Linux, Java 1.7
> Reporter: Antal Sasvári
> Assignee: Daniel Peon
> Priority: Minor
> Fix For: 3.5.1
>
>
> In our system we encountered the following problem:
> If the system is stable, and there is no leader election, the leader election
> port connections are open for very long time without any packets being sent
> on them.
> Some network elements silently drop the established TCP connection after a
> timeout if there are no packets being sent on it. In this case the ZK servers
> will not notice the connection loss. This causes additional delay later when
> the next leader election is started, as the TCP connections are not alive any
> more.
> We would like to be able to enable TCP keepalive on the leader election
> sockets in order to prevent the connection timeout in some network elements
> due to connection inactivity.
> This could be controlled by adding a new config parameter called tcpKeepAlive
> in the ZooKeeper configuration file. It would be only applicable in case of
> algorithm 3 (TCP based fast leader election), having the default value false.
> If tcpKeepAlive is set to true, the TCP keepalive flag should be enabled for
> the leader election sockets in QuorumCnxManager.setSockOpts() by calling
> sock.setKeepAlive(true).
> We have tested this change successfully in our environment.
> Please comment whether you see any problem with this. If not, I am going to
> submit a patch.
> I've been told that e.g. Apache ActiveMQ also has a config option for similar
> purpose called transport.keepalive.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)