[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113792#comment-16113792 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user CheneySun closed the pull request at: https://github.com/apache/zookeeper/pull/312 > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > Fix For: 3.4.11 > > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect > ZooKeeper service. > Actually, we came across the issue during maintaining our HBase cluster, > which used a 5-server ZooKeeper cluster. > The HBase cluster was composed of many many regionservers (in thousand order > of magnitude), > and connected by tens thousands of clients to do massive reads/writes. > Because the r/w throughput is very high, ZooKeeper zxid increased quickly as > well. > Basically, each two or three weeks, Zookeeper would make leader relection > triggered by the zxid roll over. > The leader relection will cause the clients(HBase regionservers and HBase > clients) disconnected > and reconnected with Zookeeper servers in the mean time, and try to renew the > sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone > all the connections at first > in order to avoid race condition in multi-threads and go iterate the cloned > connection set one by one to > find the related session to renew. It's very time consuming. In our case > (described above), > it caused many region servers can't successfully renew session before session > timeout, > and eventually the HBase cluster lose these region servers and affect the > HBase stability. > The change is to make refactoring to the close session logic and introduce a > ConcurrentHashMap > to store session id and connection map relation, which is a thread-safe data > structure > and eliminate the necessary to clone the connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113791#comment-16113791 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user CheneySun commented on the issue: https://github.com/apache/zookeeper/pull/312 @hanm @eribeiro Thanks for your help. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > Fix For: 3.4.11 > > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect > ZooKeeper service. > Actually, we came across the issue during maintaining our HBase cluster, > which used a 5-server ZooKeeper cluster. > The HBase cluster was composed of many many regionservers (in thousand order > of magnitude), > and connected by tens thousands of clients to do massive reads/writes. > Because the r/w throughput is very high, ZooKeeper zxid increased quickly as > well. > Basically, each two or three weeks, Zookeeper would make leader relection > triggered by the zxid roll over. > The leader relection will cause the clients(HBase regionservers and HBase > clients) disconnected > and reconnected with Zookeeper servers in the mean time, and try to renew the > sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone > all the connections at first > in order to avoid race condition in multi-threads and go iterate the cloned > connection set one by one to > find the related session to renew. It's very time consuming. In our case > (described above), > it caused many region servers can't successfully renew session before session > timeout, > and eventually the HBase cluster lose these region servers and affect the > HBase stability. > The change is to make refactoring to the close session logic and introduce a > ConcurrentHashMap > to store session id and connection map relation, which is a thread-safe data > structure > and eliminate the necessary to clone the connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113030#comment-16113030 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/312 Committed to 3.4: https://github.com/apache/zookeeper/commit/7294f8b1b260c76fc6cdd5d3f6e5125c4e9577b3. Thanks for your contribution, @CheneySun. Please close the pull request. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > Fix For: 3.4.11 > > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect > ZooKeeper service. > Actually, we came across the issue during maintaining our HBase cluster, > which used a 5-server ZooKeeper cluster. > The HBase cluster was composed of many many regionservers (in thousand order > of magnitude), > and connected by tens thousands of clients to do massive reads/writes. > Because the r/w throughput is very high, ZooKeeper zxid increased quickly as > well. > Basically, each two or three weeks, Zookeeper would make leader relection > triggered by the zxid roll over. > The leader relection will cause the clients(HBase regionservers and HBase > clients) disconnected > and reconnected with Zookeeper servers in the mean time, and try to renew the > sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone > all the connections at first > in order to avoid race condition in multi-threads and go iterate the cloned > connection set one by one to > find the related session to renew. It's very time consuming. In our case > (described above), > it caused many region servers can't successfully renew session before session > timeout, > and eventually the HBase cluster lose these region servers and affect the > HBase stability. > The change is to make refactoring to the close session logic and introduce a > ConcurrentHashMap > to store session id and connection map relation, which is a thread-safe data > structure > and eliminate the necessary to clone the connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112715#comment-16112715 ] Hadoop QA commented on ZOOKEEPER-1669: -- +1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/920//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/920//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/920//console This message is automatically generated. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect > ZooKeeper service. > Actually, we came across the issue during maintaining our HBase cluster, > which used a 5-server ZooKeeper cluster. > The HBase cluster was composed of many many regionservers (in thousand order > of magnitude), > and connected by tens thousands of clients to do massive reads/writes. > Because the r/w throughput is very high, ZooKeeper zxid increased quickly as > well. > Basically, each two or three weeks, Zookeeper would make leader relection > triggered by the zxid roll over. > The leader relection will cause the clients(HBase regionservers and HBase > clients) disconnected > and reconnected with Zookeeper servers in the mean time, and try to renew the > sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone > all the connections at first > in order to avoid race condition in multi-threads and go iterate the cloned > connection set one by one to > find the related session to renew. It's very time consuming. In our case > (described above), > it caused many region servers can't successfully renew session before session > timeout, > and eventually the HBase cluster lose these region servers and affect the > HBase stability. > The change is to make refactoring to the close session logic and introduce a > ConcurrentHashMap > to store session id and connection map relation, which is a thread-safe data > structure > and eliminate the necessary to clone the connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112686#comment-16112686 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user CheneySun commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/312#discussion_r131136171 --- Diff: src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java --- @@ -1001,25 +1010,14 @@ public String toString() { @Override public void close() { synchronized(factory.cnxns){ --- End diff -- @hanm the synchronization is indeed excessive. already removed. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect > ZooKeeper service. > Actually, we came across the issue during maintaining our HBase cluster, > which used a 5-server ZooKeeper cluster. > The HBase cluster was composed of many many regionservers (in thousand order > of magnitude), > and connected by tens thousands of clients to do massive reads/writes. > Because the r/w throughput is very high, ZooKeeper zxid increased quickly as > well. > Basically, each two or three weeks, Zookeeper would make leader relection > triggered by the zxid roll over. > The leader relection will cause the clients(HBase regionservers and HBase > clients) disconnected > and reconnected with Zookeeper servers in the mean time, and try to renew the > sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone > all the connections at first > in order to avoid race condition in multi-threads and go iterate the cloned > connection set one by one to > find the related session to renew. It's very time consuming. In our case > (described above), > it caused many region servers can't successfully renew session before session > timeout, > and eventually the HBase cluster lose these region servers and affect the > HBase stability. > The change is to make refactoring to the close session logic and introduce a > ConcurrentHashMap > to store session id and connection map relation, which is a thread-safe data > structure > and eliminate the necessary to clone the connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112142#comment-16112142 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user hanm commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/312#discussion_r131049051 --- Diff: src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java --- @@ -1001,25 +1010,14 @@ public String toString() { @Override public void close() { synchronized(factory.cnxns){ --- End diff -- @CheneySun Please let me know if what you think regarding my comment about removing the excessive synchronization here. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect > ZooKeeper service. > Actually, we came across the issue during maintaining our HBase cluster, > which used a 5-server ZooKeeper cluster. > The HBase cluster was composed of many many regionservers (in thousand order > of magnitude), > and connected by tens thousands of clients to do massive reads/writes. > Because the r/w throughput is very high, ZooKeeper zxid increased quickly as > well. > Basically, each two or three weeks, Zookeeper would make leader relection > triggered by the zxid roll over. > The leader relection will cause the clients(HBase regionservers and HBase > clients) disconnected > and reconnected with Zookeeper servers in the mean time, and try to renew the > sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone > all the connections at first > in order to avoid race condition in multi-threads and go iterate the cloned > connection set one by one to > find the related session to renew. It's very time consuming. In our case > (described above), > it caused many region servers can't successfully renew session before session > timeout, > and eventually the HBase cluster lose these region servers and affect the > HBase stability. > The change is to make refactoring to the close session logic and introduce a > ConcurrentHashMap > to store session id and connection map relation, which is a thread-safe data > structure > and eliminate the necessary to clone the connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106730#comment-16106730 ] Hadoop QA commented on ZOOKEEPER-1669: -- +1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/910//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/910//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/910//console This message is automatically generated. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect > ZooKeeper service. > Actually, we came across the issue during maintaining our HBase cluster, > which used a 5-server ZooKeeper cluster. > The HBase cluster was composed of many many regionservers (in thousand order > of magnitude), > and connected by tens thousands of clients to do massive reads/writes. > Because the r/w throughput is very high, ZooKeeper zxid increased quickly as > well. > Basically, each two or three weeks, Zookeeper would make leader relection > triggered by the zxid roll over. > The leader relection will cause the clients(HBase regionservers and HBase > clients) disconnected > and reconnected with Zookeeper servers in the mean time, and try to renew the > sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone > all the connections at first > in order to avoid race condition in multi-threads and go iterate the cloned > connection set one by one to > find the related session to renew. It's very time consuming. In our case > (described above), > it caused many region servers can't successfully renew session before session > timeout, > and eventually the HBase cluster lose these region servers and affect the > HBase stability. > The change is to make refactoring to the close session logic and introduce a > ConcurrentHashMap > to store session id and connection map relation, which is a thread-safe data > structure > and eliminate the necessary to clone the connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106717#comment-16106717 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user CheneySun commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/312#discussion_r130259699 --- Diff: src/java/main/org/apache/zookeeper/server/ServerCnxn.java --- @@ -101,6 +102,13 @@ public boolean removeAuthInfo(Id id) { abstract void setSessionTimeout(int sessionTimeout); +/** + * Wrapper method to return the socket address + */ +public InetAddress getSocketAddress() { --- End diff -- fixed. Thanks @eribeiro . > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect > ZooKeeper service. > Actually, we came across the issue during maintaining our HBase cluster, > which used a 5-server ZooKeeper cluster. > The HBase cluster was composed of many many regionservers (in thousand order > of magnitude), > and connected by tens thousands of clients to do massive reads/writes. > Because the r/w throughput is very high, ZooKeeper zxid increased quickly as > well. > Basically, each two or three weeks, Zookeeper would make leader relection > triggered by the zxid roll over. > The leader relection will cause the clients(HBase regionservers and HBase > clients) disconnected > and reconnected with Zookeeper servers in the mean time, and try to renew the > sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone > all the connections at first > in order to avoid race condition in multi-threads and go iterate the cloned > connection set one by one to > find the related session to renew. It's very time consuming. In our case > (described above), > it caused many region servers can't successfully renew session before session > timeout, > and eventually the HBase cluster lose these region servers and affect the > HBase stability. > The change is to make refactoring to the close session logic and introduce a > ConcurrentHashMap > to store session id and connection map relation, which is a thread-safe data > structure > and eliminate the necessary to clone the connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106546#comment-16106546 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/312 3.5 and master already has sessionMap so the issue this PR attempts to fix should not be a problem. The problem in 3.5 and master is NettyServerCnx and NIOServerCnx has a little bit of mismatch which should be fixed in a separate JIRA. So let's scope this PR for 3.4 only. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect > ZooKeeper service. > Actually, we came across the issue during maintaining our HBase cluster, > which used a 5-server ZooKeeper cluster. > The HBase cluster was composed of many many regionservers (in thousand order > of magnitude), > and connected by tens thousands of clients to do massive reads/writes. > Because the r/w throughput is very high, ZooKeeper zxid increased quickly as > well. > Basically, each two or three weeks, Zookeeper would make leader relection > triggered by the zxid roll over. > The leader relection will cause the clients(HBase regionservers and HBase > clients) disconnected > and reconnected with Zookeeper servers in the mean time, and try to renew the > sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone > all the connections at first > in order to avoid race condition in multi-threads and go iterate the cloned > connection set one by one to > find the related session to renew. It's very time consuming. In our case > (described above), > it caused many region servers can't successfully renew session before session > timeout, > and eventually the HBase cluster lose these region servers and affect the > HBase stability. > The change is to make refactoring to the close session logic and introduce a > ConcurrentHashMap > to store session id and connection map relation, which is a thread-safe data > structure > and eliminate the necessary to clone the connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106542#comment-16106542 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user eribeiro commented on the issue: https://github.com/apache/zookeeper/pull/312 @CheneySun I see this patch doesn't apply to `branch-3.5/master` so make sure you open another PR to address it on those branches. 👍 > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect > ZooKeeper service. > Actually, we came across the issue during maintaining our HBase cluster, > which used a 5-server ZooKeeper cluster. > The HBase cluster was composed of many many regionservers (in thousand order > of magnitude), > and connected by tens thousands of clients to do massive reads/writes. > Because the r/w throughput is very high, ZooKeeper zxid increased quickly as > well. > Basically, each two or three weeks, Zookeeper would make leader relection > triggered by the zxid roll over. > The leader relection will cause the clients(HBase regionservers and HBase > clients) disconnected > and reconnected with Zookeeper servers in the mean time, and try to renew the > sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone > all the connections at first > in order to avoid race condition in multi-threads and go iterate the cloned > connection set one by one to > find the related session to renew. It's very time consuming. In our case > (described above), > it caused many region servers can't successfully renew session before session > timeout, > and eventually the HBase cluster lose these region servers and affect the > HBase stability. > The change is to make refactoring to the close session logic and introduce a > ConcurrentHashMap > to store session id and connection map relation, which is a thread-safe data > structure > and eliminate the necessary to clone the connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106538#comment-16106538 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/312#discussion_r130229571 --- Diff: src/java/main/org/apache/zookeeper/server/ServerCnxn.java --- @@ -101,6 +102,13 @@ public boolean removeAuthInfo(Id id) { abstract void setSessionTimeout(int sessionTimeout); +/** + * Wrapper method to return the socket address + */ +public InetAddress getSocketAddress() { --- End diff -- `public abstract InetAddress getSocketAddress();` > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect > ZooKeeper service. > Actually, we came across the issue during maintaining our HBase cluster, > which used a 5-server ZooKeeper cluster. > The HBase cluster was composed of many many regionservers (in thousand order > of magnitude), > and connected by tens thousands of clients to do massive reads/writes. > Because the r/w throughput is very high, ZooKeeper zxid increased quickly as > well. > Basically, each two or three weeks, Zookeeper would make leader relection > triggered by the zxid roll over. > The leader relection will cause the clients(HBase regionservers and HBase > clients) disconnected > and reconnected with Zookeeper servers in the mean time, and try to renew the > sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone > all the connections at first > in order to avoid race condition in multi-threads and go iterate the cloned > connection set one by one to > find the related session to renew. It's very time consuming. In our case > (described above), > it caused many region servers can't successfully renew session before session > timeout, > and eventually the HBase cluster lose these region servers and affect the > HBase stability. > The change is to make refactoring to the close session logic and introduce a > ConcurrentHashMap > to store session id and connection map relation, which is a thread-safe data > structure > and eliminate the necessary to clone the connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106335#comment-16106335 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user hanm commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/312#discussion_r130233860 --- Diff: src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java --- @@ -1001,25 +1010,14 @@ public String toString() { @Override public void close() { synchronized(factory.cnxns){ --- End diff -- The removeCnxn already synchronizes on the cnxns so this synchronization can be removed. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect > ZooKeeper service. > Actually, we came across the issue during maintaining our HBase cluster, > which used a 5-server ZooKeeper cluster. > The HBase cluster was composed of many many regionservers (in thousand order > of magnitude), > and connected by tens thousands of clients to do massive reads/writes. > Because the r/w throughput is very high, ZooKeeper zxid increased quickly as > well. > Basically, each two or three weeks, Zookeeper would make leader relection > triggered by the zxid roll over. > The leader relection will cause the clients(HBase regionservers and HBase > clients) disconnected > and reconnected with Zookeeper servers in the mean time, and try to renew the > sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone > all the connections at first > in order to avoid race condition in multi-threads and go iterate the cloned > connection set one by one to > find the related session to renew. It's very time consuming. In our case > (described above), > it caused many region servers can't successfully renew session before session > timeout, > and eventually the HBase cluster lose these region servers and affect the > HBase stability. > The change is to make refactoring to the close session logic and introduce a > ConcurrentHashMap > to store session id and connection map relation, which is a thread-safe data > structure > and eliminate the necessary to clone the connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106336#comment-16106336 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user hanm commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/312#discussion_r130233880 --- Diff: src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java --- @@ -1001,25 +1010,14 @@ public String toString() { @Override public void close() { synchronized(factory.cnxns){ --- End diff -- Other than this the patch looks good. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect > ZooKeeper service. > Actually, we came across the issue during maintaining our HBase cluster, > which used a 5-server ZooKeeper cluster. > The HBase cluster was composed of many many regionservers (in thousand order > of magnitude), > and connected by tens thousands of clients to do massive reads/writes. > Because the r/w throughput is very high, ZooKeeper zxid increased quickly as > well. > Basically, each two or three weeks, Zookeeper would make leader relection > triggered by the zxid roll over. > The leader relection will cause the clients(HBase regionservers and HBase > clients) disconnected > and reconnected with Zookeeper servers in the mean time, and try to renew the > sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone > all the connections at first > in order to avoid race condition in multi-threads and go iterate the cloned > connection set one by one to > find the related session to renew. It's very time consuming. In our case > (described above), > it caused many region servers can't successfully renew session before session > timeout, > and eventually the HBase cluster lose these region servers and affect the > HBase stability. > The change is to make refactoring to the close session logic and introduce a > ConcurrentHashMap > to store session id and connection map relation, which is a thread-safe data > structure > and eliminate the necessary to clone the connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104837#comment-16104837 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user CheneySun commented on the issue: https://github.com/apache/zookeeper/pull/312 @eribeiro @hanm Can you review this PR again? Thanks. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect > ZooKeeper service. > Actually, we came across the issue during maintaining our HBase cluster, > which used a 5-server ZooKeeper cluster. > The HBase cluster was composed of many many regionservers (in thousand order > of magnitude), > and connected by tens thousands of clients to do massive reads/writes. > Because the r/w throughput is very high, ZooKeeper zxid increased quickly as > well. > Basically, each two or three weeks, Zookeeper would make leader relection > triggered by the zxid roll over. > The leader relection will cause the clients(HBase regionservers and HBase > clients) disconnected > and reconnected with Zookeeper servers in the mean time, and try to renew the > sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone > all the connections at first > in order to avoid race condition in multi-threads and go iterate the cloned > connection set one by one to > find the related session to renew. It's very time consuming. In our case > (described above), > it caused many region servers can't successfully renew session before session > timeout, > and eventually the HBase cluster lose these region servers and affect the > HBase stability. > The change is to make refactoring to the close session logic and introduce a > ConcurrentHashMap > to store session id and connection map relation, which is a thread-safe data > structure > and eliminate the necessary to clone the connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16097650#comment-16097650 ] Hadoop QA commented on ZOOKEEPER-1669: -- +1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/895//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/895//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/895//console This message is automatically generated. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect > ZooKeeper service. > Actually, we came across the issue during maintaining our HBase cluster, > which used a 5-server ZooKeeper cluster. > The HBase cluster was composed of many many regionservers (in thousand order > of magnitude), > and connected by tens thousands of clients to do massive reads/writes. > Because the r/w throughput is very high, ZooKeeper zxid increased quickly as > well. > Basically, each two or three weeks, Zookeeper would make leader relection > triggered by the zxid roll over. > The leader relection will cause the clients(HBase regionservers and HBase > clients) disconnected > and reconnected with Zookeeper servers in the mean time, and try to renew the > sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone > all the connections at first > in order to avoid race condition in multi-threads and go iterate the cloned > connection set one by one to > find the related session to renew. It's very time consuming. In our case > (described above), > it caused many region servers can't successfully renew session before session > timeout, > and eventually the HBase cluster lose these region servers and affect the > HBase stability. > The change is to make refactoring to the close session logic and introduce a > ConcurrentHashMap > to store session id and connection map relation, which is a thread-safe data > structure > and eliminate the necessary to clone the connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16097645#comment-16097645 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user CheneySun commented on the issue: https://github.com/apache/zookeeper/pull/312 @eribeiro @hanm Thanks for your kindly suggestions. In branch-3.5, I found the issue was already fixed in [ZOOKEEPER-1504], which was linked with [ZOOKEEPER-1347]. So forget it cited above, I really wanted to cite [ZOOKEEPER-1504]. Sorry about the confusion. The changes now are also made to NettyServerCnxn(Factory), and the PR description are also updated. Can you continue review the changes? > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect > ZooKeeper service. > Actually, we came across the issue during maintaining our HBase cluster, > which used a 5-server ZooKeeper cluster. > The HBase cluster was composed of many many regionservers (in thousand order > of magnitude), > and connected by tens thousands of clients to do massive reads/writes. > Because the r/w throughput is very high, ZooKeeper zxid increased quickly as > well. > Basically, each two or three weeks, Zookeeper would make leader relection > triggered by the zxid roll over. > The leader relection will cause the clients(HBase regionservers and HBase > clients) disconnected > and reconnected with Zookeeper servers in the mean time, and try to renew the > sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone > all the connections at first > in order to avoid race condition in multi-threads and go iterate the cloned > connection set one by one to > find the related session to renew. It's very time consuming. In our case > (described above), > it caused many region servers can't successfully renew session before session > timeout, > and eventually the HBase cluster lose these region servers and affect the > HBase stability. > The change is to make refactoring to the close session logic and introduce a > ConcurrentHashMap > to store session id and connection map relation, which is a thread-safe data > structure > and eliminate the necessary to clone the connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16096415#comment-16096415 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/312 @CheneySun Good summary! I've posted those on the JIRA description. Please also update the description of this pull request with the same (I can't modify your pull request:). > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect > ZooKeeper service. Actually, we came across the issue during maintaining our > HBase cluster, which used a 5-server ZooKeeper cluster. The HBase cluster was > composed of many many regionservers (in thousand order of magnitude), and > connected by tens thousands of clients to do massive reads/writes. Because > the r/w throughput is very high, ZooKeeper zxid increased quickly as well. > Basically, each two or three weeks, Zookeeper would make leader relection > triggered by the zxid roll over. The leader relection will cause the > clients(HBase regionservers and HBase clients) disconnected and reconnected > with Zookeeper servers in the mean time, and try to renew the sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone > all the connections at first in order to avoid race condition in > multi-threads and go iterate the cloned connection set one by one to find the > related session to renew. It's very time consuming. In our case > (described above), it caused many region servers can't successfully renew > session before session timeout, and eventually the HBase cluster lose these > region servers and affect the HBase stability. > The change is to make refactoring to the close session logic and introduce a > ConcurrentHashMap to store session id and connection map relation, which is a > thread-safe data structure and eliminate the necessary to clone the > connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16096271#comment-16096271 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user eribeiro commented on the issue: https://github.com/apache/zookeeper/pull/312 @cheneysun mate, the description you wrote to @mhan should be **both** on the JIRA description and this PR comment, not on review comments. It helps setup context, motivation, etc. Keep this in mind next time. But add this absent pieces accordingly. ;) Also, you cited [ZOOKEEPER-1347] but, as Michael wrote, it seems an unrelated ticket. :thinking: Could elaborate that? > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16096243#comment-16096243 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user eribeiro commented on the issue: https://github.com/apache/zookeeper/pull/312 You can setup Netty by setting system property: `zookeeper.serverCnxnFactory="org.apache.zookeeper.server.NettyServerCnxnFactory"` Take a look at some test cases. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16096094#comment-16096094 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user CheneySun commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/312#discussion_r128729244 --- Diff: src/java/main/org/apache/zookeeper/server/NIOServerCnxnFactory.java --- @@ -275,20 +307,9 @@ public synchronized void closeSession(long sessionId) { @SuppressWarnings("unchecked") private void closeSessionWithoutWakeup(long sessionId) { -HashSet cnxns; -synchronized (this.cnxns) { -cnxns = (HashSet)this.cnxns.clone(); -} - -for (NIOServerCnxn cnxn : cnxns) { -if (cnxn.getSessionId() == sessionId) { -try { -cnxn.close(); -} catch (Exception e) { -LOG.warn("exception during session close", e); -} -break; -} +NIOServerCnxn cnxn = sessionMap.remove(sessionId); +if (cnxn != null) { +cnxn.close(); --- End diff -- @eribeiro good catch, I will fix it. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095982#comment-16095982 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user CheneySun commented on the issue: https://github.com/apache/zookeeper/pull/312 @eribeiro Thanks for your reivew, I will replicate the changes to NettyServerCnxn. BTW, how to make use of NettyServerCnxn as the underlying transport? The NIOServerCnxn is default transport implementation, and I didn't find the knobs to switch to use NettyServerCnxn. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095977#comment-16095977 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user CheneySun commented on the issue: https://github.com/apache/zookeeper/pull/312 @hanm The issue is raised while tens thousands of clients try to reconnect ZooKeeper service. Actually, we came across the issue during maintaining our HBase cluster, which used a 5-server ZooKeeper cluster. The HBase cluster was composed of many many regionservers (in thousand order of magnitude), and connected by tens thousands of clients to do massive reads/writes. Because the r/w throughput is very high, ZooKeeper zxid increased quickly as well. Basically, each two or three weeks, Zookeeper would make leader relection triggered by the zxid roll over. The leader relection will cause the clients(HBase regionservers and HBase clients) disconnected and reconnected with Zookeeper servers in the mean time, and try to renew the sessions. In current implementation of session renew, NIOServerCnxnFactory will clone all the connections at first in order to avoid race condition in multi-threads and go iterate the cloned connection set one by one to find the related session to renew. It's very time consuming. In our case (described above), it caused many region servers can't successfully renew session before session timeout, and eventually the HBase cluster lose these region servers and affect the HBase stability. The change is to make refactoring to the close session logic and introduce a ConcurrentHashMap to store session id and connection map relation, which is a thread-safe data structure and eliminate the necessary to clone the connection set at first. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095820#comment-16095820 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user eribeiro commented on the issue: https://github.com/apache/zookeeper/pull/312 @CheneySun Don't forget to replicate these changes on `NettyServerCnxn` and its factory. It's important to let them in sync as much as possible, even more if you are adding a new data structure to speed up this part of the code: https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/NettyServerCnxnFactory.java#L414-L423 > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095814#comment-16095814 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/312#discussion_r128688215 --- Diff: src/java/main/org/apache/zookeeper/server/NIOServerCnxnFactory.java --- @@ -62,6 +63,10 @@ */ final ByteBuffer directBuffer = ByteBuffer.allocateDirect(64 * 1024); +// sessionMap is used to accelerate closeSession() +private final ConcurrentHashMap sessionMap = --- End diff -- `private final ConcurrentMap sessionMap = ` > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095812#comment-16095812 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/312#discussion_r128688135 --- Diff: src/java/main/org/apache/zookeeper/server/NIOServerCnxnFactory.java --- @@ -275,20 +307,9 @@ public synchronized void closeSession(long sessionId) { @SuppressWarnings("unchecked") private void closeSessionWithoutWakeup(long sessionId) { -HashSet cnxns; -synchronized (this.cnxns) { -cnxns = (HashSet)this.cnxns.clone(); -} - -for (NIOServerCnxn cnxn : cnxns) { -if (cnxn.getSessionId() == sessionId) { -try { -cnxn.close(); -} catch (Exception e) { -LOG.warn("exception during session close", e); -} -break; -} +NIOServerCnxn cnxn = sessionMap.remove(sessionId); +if (cnxn != null) { +cnxn.close(); --- End diff -- Why did you remove the `try-catch` block around `cnxn.close()`? We still can have exceptions being thrown during `cnxn.close()`, right? > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095800#comment-16095800 ] Hadoop QA commented on ZOOKEEPER-1669: -- +1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/894//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/894//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/894//console This message is automatically generated. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094168#comment-16094168 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/312 @CheneySun some quick comments: * Can you please add more description to the pull request regarding how this patch fixes the issue? You mentioned "just porting the work in [ZOOKEEPER-1347] ", but I don't see ZOOKEEPER-1347 has a patch nor it's committed to master. * There are some format only changes such as indentation changes - we prefer not mixing format change with functional changes in a patch because it will make reviewer harder. But for this case I think it's fine because old code was not well formatted and format only changes are not too big to review. I'll take another pass on your patch later this week. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092905#comment-16092905 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- Github user CheneySun commented on the issue: https://github.com/apache/zookeeper/pull/312 @hanm can you take a review of this PR. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091520#comment-16091520 ] Hadoop QA commented on ZOOKEEPER-1669: -- +1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/885//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/885//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/885//console This message is automatically generated. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091472#comment-16091472 ] ASF GitHub Bot commented on ZOOKEEPER-1669: --- GitHub user CheneySun opened a pull request: https://github.com/apache/zookeeper/pull/312 ZOOKEEPER-1669: Operations to server will be timed-out while thousands of sessions expired same time just porting the work in [ZOOKEEPER-1347] to branch 3.4 You can merge this pull request into a Git repository by running: $ git pull https://github.com/CheneySun/zookeeper branch-3.4 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/312.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #312 commit 59d71077640643f13f036dd67741ef944b48255b Author: Cheney Sun Date: 2017-07-18T12:14:01Z ZOOKEEPER-1669: Operations to server will be timed-out while thousands of sessions expired same time > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091403#comment-16091403 ] Cheney Sun commented on ZOOKEEPER-1669: --- Looks like the issue is already addressed in [ZOOKEEPER-1347] and a patch was available, but not fixed in 3.4, which is the latest stable version and the one we currently use. @Michael, is it meaningful to porting the fix to older version, 3.4 or 3.3? > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot >Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085223#comment-16085223 ] Cheney Sun commented on ZOOKEEPER-1669: --- @Michael, Yes, I would like to fix it. Actually, I have made an initial change, which gained a great performance improvement. I will submit the official patch in the coming days. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085182#comment-16085182 ] Michael Han commented on ZOOKEEPER-1669: I don't think there is any plans or anyone is actively working on this issue. It seems to be a good performance improvement that worth doing though. [~sun.cheney] If you have a fix are you willing to submit a patch? I can help review it. Also, if you can share your use case here that will greatly benefit the community. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16083400#comment-16083400 ] Cheney Sun commented on ZOOKEEPER-1669: --- Is there any plan to fix the issue? we have come into the same issue several times in past weeks. Looks like the latest version doesn't address the issue. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608802#comment-13608802 ] Jacky007 commented on ZOOKEEPER-1669: - I think it is. In one of our environment, there are tens of thousands connections and 300~500/s close session(these clients create a connection for a read, and close it immediately). The codes you described significantly affect performance. > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13604613#comment-13604613 ] tokoot commented on ZOOKEEPER-1669: --- Thanks for your advice, Jacky007 We have solved it with the same way. And I see the problem is still exist in latest version, should we fix it in next one? > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603262#comment-13603262 ] Jacky007 commented on ZOOKEEPER-1669: - We have paid for this. But the fix is simple, you can hash it when the session is created, and find from hash when close it. :) > Operations to server will be timed-out while thousands of sessions expired > same time > > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.3.5 >Reporter: tokoot > Labels: performance > > If there are thousands of clients, and most of them disconnect with server > same time(client restarted or servers partitioned with clients), the server > will busy to close those "connections" and become unavailable. The problem is > in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other > thread will block because of here > } > ... > } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira