[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-08-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113792#comment-16113792
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user CheneySun closed the pull request at:

https://github.com/apache/zookeeper/pull/312


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
> Fix For: 3.4.11
>
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. 
> Actually, we came across the issue during maintaining our HBase cluster, 
> which used a 5-server ZooKeeper cluster. 
> The HBase cluster was composed of many many regionservers (in thousand order 
> of magnitude), 
> and connected by tens thousands of clients to do massive reads/writes. 
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as 
> well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. 
> The leader relection will cause the clients(HBase regionservers and HBase 
> clients) disconnected 
> and reconnected with Zookeeper servers in the mean time, and try to renew the 
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first 
> in order to avoid race condition in multi-threads and go iterate the cloned 
> connection set one by one to 
> find the related session to renew. It's very time consuming. In our case 
> (described above), 
> it caused many region servers can't successfully renew session before session 
> timeout, 
> and eventually the HBase cluster lose these region servers and affect the 
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap 
> to store session id and connection map relation, which is a thread-safe data 
> structure 
> and eliminate the necessary to clone the connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-08-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113791#comment-16113791
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user CheneySun commented on the issue:

https://github.com/apache/zookeeper/pull/312
  
@hanm @eribeiro Thanks for your help. 


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
> Fix For: 3.4.11
>
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. 
> Actually, we came across the issue during maintaining our HBase cluster, 
> which used a 5-server ZooKeeper cluster. 
> The HBase cluster was composed of many many regionservers (in thousand order 
> of magnitude), 
> and connected by tens thousands of clients to do massive reads/writes. 
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as 
> well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. 
> The leader relection will cause the clients(HBase regionservers and HBase 
> clients) disconnected 
> and reconnected with Zookeeper servers in the mean time, and try to renew the 
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first 
> in order to avoid race condition in multi-threads and go iterate the cloned 
> connection set one by one to 
> find the related session to renew. It's very time consuming. In our case 
> (described above), 
> it caused many region servers can't successfully renew session before session 
> timeout, 
> and eventually the HBase cluster lose these region servers and affect the 
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap 
> to store session id and connection map relation, which is a thread-safe data 
> structure 
> and eliminate the necessary to clone the connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-08-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113030#comment-16113030
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/312
  
Committed to 3.4: 
https://github.com/apache/zookeeper/commit/7294f8b1b260c76fc6cdd5d3f6e5125c4e9577b3.

Thanks for your contribution, @CheneySun. Please close the pull request.


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
> Fix For: 3.4.11
>
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. 
> Actually, we came across the issue during maintaining our HBase cluster, 
> which used a 5-server ZooKeeper cluster. 
> The HBase cluster was composed of many many regionservers (in thousand order 
> of magnitude), 
> and connected by tens thousands of clients to do massive reads/writes. 
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as 
> well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. 
> The leader relection will cause the clients(HBase regionservers and HBase 
> clients) disconnected 
> and reconnected with Zookeeper servers in the mean time, and try to renew the 
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first 
> in order to avoid race condition in multi-threads and go iterate the cloned 
> connection set one by one to 
> find the related session to renew. It's very time consuming. In our case 
> (described above), 
> it caused many region servers can't successfully renew session before session 
> timeout, 
> and eventually the HBase cluster lose these region servers and affect the 
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap 
> to store session id and connection map relation, which is a thread-safe data 
> structure 
> and eliminate the necessary to clone the connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-08-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112715#comment-16112715
 ] 

Hadoop QA commented on ZOOKEEPER-1669:
--

+1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+0 tests included.  The patch appears to be a documentation patch that 
doesn't require tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/920//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/920//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/920//console

This message is automatically generated.

> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. 
> Actually, we came across the issue during maintaining our HBase cluster, 
> which used a 5-server ZooKeeper cluster. 
> The HBase cluster was composed of many many regionservers (in thousand order 
> of magnitude), 
> and connected by tens thousands of clients to do massive reads/writes. 
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as 
> well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. 
> The leader relection will cause the clients(HBase regionservers and HBase 
> clients) disconnected 
> and reconnected with Zookeeper servers in the mean time, and try to renew the 
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first 
> in order to avoid race condition in multi-threads and go iterate the cloned 
> connection set one by one to 
> find the related session to renew. It's very time consuming. In our case 
> (described above), 
> it caused many region servers can't successfully renew session before session 
> timeout, 
> and eventually the HBase cluster lose these region servers and affect the 
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap 
> to store session id and connection map relation, which is a thread-safe data 
> structure 
> and eliminate the necessary to clone the connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-08-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112686#comment-16112686
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user CheneySun commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/312#discussion_r131136171
  
--- Diff: src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java ---
@@ -1001,25 +1010,14 @@ public String toString() {
 @Override
 public void close() {
 synchronized(factory.cnxns){
--- End diff --

@hanm the synchronization is indeed excessive. already removed.


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. 
> Actually, we came across the issue during maintaining our HBase cluster, 
> which used a 5-server ZooKeeper cluster. 
> The HBase cluster was composed of many many regionservers (in thousand order 
> of magnitude), 
> and connected by tens thousands of clients to do massive reads/writes. 
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as 
> well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. 
> The leader relection will cause the clients(HBase regionservers and HBase 
> clients) disconnected 
> and reconnected with Zookeeper servers in the mean time, and try to renew the 
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first 
> in order to avoid race condition in multi-threads and go iterate the cloned 
> connection set one by one to 
> find the related session to renew. It's very time consuming. In our case 
> (described above), 
> it caused many region servers can't successfully renew session before session 
> timeout, 
> and eventually the HBase cluster lose these region servers and affect the 
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap 
> to store session id and connection map relation, which is a thread-safe data 
> structure 
> and eliminate the necessary to clone the connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-08-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112142#comment-16112142
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user hanm commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/312#discussion_r131049051
  
--- Diff: src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java ---
@@ -1001,25 +1010,14 @@ public String toString() {
 @Override
 public void close() {
 synchronized(factory.cnxns){
--- End diff --

@CheneySun Please let me know if what you think regarding my comment about 
removing the excessive synchronization here. 


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. 
> Actually, we came across the issue during maintaining our HBase cluster, 
> which used a 5-server ZooKeeper cluster. 
> The HBase cluster was composed of many many regionservers (in thousand order 
> of magnitude), 
> and connected by tens thousands of clients to do massive reads/writes. 
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as 
> well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. 
> The leader relection will cause the clients(HBase regionservers and HBase 
> clients) disconnected 
> and reconnected with Zookeeper servers in the mean time, and try to renew the 
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first 
> in order to avoid race condition in multi-threads and go iterate the cloned 
> connection set one by one to 
> find the related session to renew. It's very time consuming. In our case 
> (described above), 
> it caused many region servers can't successfully renew session before session 
> timeout, 
> and eventually the HBase cluster lose these region servers and affect the 
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap 
> to store session id and connection map relation, which is a thread-safe data 
> structure 
> and eliminate the necessary to clone the connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106730#comment-16106730
 ] 

Hadoop QA commented on ZOOKEEPER-1669:
--

+1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+0 tests included.  The patch appears to be a documentation patch that 
doesn't require tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/910//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/910//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/910//console

This message is automatically generated.

> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. 
> Actually, we came across the issue during maintaining our HBase cluster, 
> which used a 5-server ZooKeeper cluster. 
> The HBase cluster was composed of many many regionservers (in thousand order 
> of magnitude), 
> and connected by tens thousands of clients to do massive reads/writes. 
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as 
> well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. 
> The leader relection will cause the clients(HBase regionservers and HBase 
> clients) disconnected 
> and reconnected with Zookeeper servers in the mean time, and try to renew the 
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first 
> in order to avoid race condition in multi-threads and go iterate the cloned 
> connection set one by one to 
> find the related session to renew. It's very time consuming. In our case 
> (described above), 
> it caused many region servers can't successfully renew session before session 
> timeout, 
> and eventually the HBase cluster lose these region servers and affect the 
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap 
> to store session id and connection map relation, which is a thread-safe data 
> structure 
> and eliminate the necessary to clone the connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106717#comment-16106717
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user CheneySun commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/312#discussion_r130259699
  
--- Diff: src/java/main/org/apache/zookeeper/server/ServerCnxn.java ---
@@ -101,6 +102,13 @@ public boolean removeAuthInfo(Id id) {
 
 abstract void setSessionTimeout(int sessionTimeout);
 
+/**
+ * Wrapper method to return the socket address
+ */
+public InetAddress getSocketAddress() {
--- End diff --

fixed. Thanks @eribeiro .


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. 
> Actually, we came across the issue during maintaining our HBase cluster, 
> which used a 5-server ZooKeeper cluster. 
> The HBase cluster was composed of many many regionservers (in thousand order 
> of magnitude), 
> and connected by tens thousands of clients to do massive reads/writes. 
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as 
> well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. 
> The leader relection will cause the clients(HBase regionservers and HBase 
> clients) disconnected 
> and reconnected with Zookeeper servers in the mean time, and try to renew the 
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first 
> in order to avoid race condition in multi-threads and go iterate the cloned 
> connection set one by one to 
> find the related session to renew. It's very time consuming. In our case 
> (described above), 
> it caused many region servers can't successfully renew session before session 
> timeout, 
> and eventually the HBase cluster lose these region servers and affect the 
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap 
> to store session id and connection map relation, which is a thread-safe data 
> structure 
> and eliminate the necessary to clone the connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106546#comment-16106546
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/312
  
3.5 and master already has sessionMap so the issue this PR attempts to fix 
should not be a problem. 
The problem in 3.5 and master is NettyServerCnx and NIOServerCnx has a 
little bit of mismatch which should be fixed in a separate JIRA. 
So let's scope this PR for 3.4 only.


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. 
> Actually, we came across the issue during maintaining our HBase cluster, 
> which used a 5-server ZooKeeper cluster. 
> The HBase cluster was composed of many many regionservers (in thousand order 
> of magnitude), 
> and connected by tens thousands of clients to do massive reads/writes. 
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as 
> well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. 
> The leader relection will cause the clients(HBase regionservers and HBase 
> clients) disconnected 
> and reconnected with Zookeeper servers in the mean time, and try to renew the 
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first 
> in order to avoid race condition in multi-threads and go iterate the cloned 
> connection set one by one to 
> find the related session to renew. It's very time consuming. In our case 
> (described above), 
> it caused many region servers can't successfully renew session before session 
> timeout, 
> and eventually the HBase cluster lose these region servers and affect the 
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap 
> to store session id and connection map relation, which is a thread-safe data 
> structure 
> and eliminate the necessary to clone the connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106542#comment-16106542
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user eribeiro commented on the issue:

https://github.com/apache/zookeeper/pull/312
  
@CheneySun I see this patch doesn't apply to `branch-3.5/master` so make 
sure you open another PR to address it on those branches. 👍 


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. 
> Actually, we came across the issue during maintaining our HBase cluster, 
> which used a 5-server ZooKeeper cluster. 
> The HBase cluster was composed of many many regionservers (in thousand order 
> of magnitude), 
> and connected by tens thousands of clients to do massive reads/writes. 
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as 
> well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. 
> The leader relection will cause the clients(HBase regionservers and HBase 
> clients) disconnected 
> and reconnected with Zookeeper servers in the mean time, and try to renew the 
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first 
> in order to avoid race condition in multi-threads and go iterate the cloned 
> connection set one by one to 
> find the related session to renew. It's very time consuming. In our case 
> (described above), 
> it caused many region servers can't successfully renew session before session 
> timeout, 
> and eventually the HBase cluster lose these region servers and affect the 
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap 
> to store session id and connection map relation, which is a thread-safe data 
> structure 
> and eliminate the necessary to clone the connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106538#comment-16106538
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/312#discussion_r130229571
  
--- Diff: src/java/main/org/apache/zookeeper/server/ServerCnxn.java ---
@@ -101,6 +102,13 @@ public boolean removeAuthInfo(Id id) {
 
 abstract void setSessionTimeout(int sessionTimeout);
 
+/**
+ * Wrapper method to return the socket address
+ */
+public InetAddress getSocketAddress() {
--- End diff --

`public abstract InetAddress getSocketAddress();`


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. 
> Actually, we came across the issue during maintaining our HBase cluster, 
> which used a 5-server ZooKeeper cluster. 
> The HBase cluster was composed of many many regionservers (in thousand order 
> of magnitude), 
> and connected by tens thousands of clients to do massive reads/writes. 
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as 
> well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. 
> The leader relection will cause the clients(HBase regionservers and HBase 
> clients) disconnected 
> and reconnected with Zookeeper servers in the mean time, and try to renew the 
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first 
> in order to avoid race condition in multi-threads and go iterate the cloned 
> connection set one by one to 
> find the related session to renew. It's very time consuming. In our case 
> (described above), 
> it caused many region servers can't successfully renew session before session 
> timeout, 
> and eventually the HBase cluster lose these region servers and affect the 
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap 
> to store session id and connection map relation, which is a thread-safe data 
> structure 
> and eliminate the necessary to clone the connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106335#comment-16106335
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user hanm commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/312#discussion_r130233860
  
--- Diff: src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java ---
@@ -1001,25 +1010,14 @@ public String toString() {
 @Override
 public void close() {
 synchronized(factory.cnxns){
--- End diff --

The removeCnxn already synchronizes on the cnxns so this synchronization 
can be removed. 


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. 
> Actually, we came across the issue during maintaining our HBase cluster, 
> which used a 5-server ZooKeeper cluster. 
> The HBase cluster was composed of many many regionservers (in thousand order 
> of magnitude), 
> and connected by tens thousands of clients to do massive reads/writes. 
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as 
> well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. 
> The leader relection will cause the clients(HBase regionservers and HBase 
> clients) disconnected 
> and reconnected with Zookeeper servers in the mean time, and try to renew the 
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first 
> in order to avoid race condition in multi-threads and go iterate the cloned 
> connection set one by one to 
> find the related session to renew. It's very time consuming. In our case 
> (described above), 
> it caused many region servers can't successfully renew session before session 
> timeout, 
> and eventually the HBase cluster lose these region servers and affect the 
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap 
> to store session id and connection map relation, which is a thread-safe data 
> structure 
> and eliminate the necessary to clone the connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106336#comment-16106336
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user hanm commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/312#discussion_r130233880
  
--- Diff: src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java ---
@@ -1001,25 +1010,14 @@ public String toString() {
 @Override
 public void close() {
 synchronized(factory.cnxns){
--- End diff --

Other than this the patch looks good. 


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. 
> Actually, we came across the issue during maintaining our HBase cluster, 
> which used a 5-server ZooKeeper cluster. 
> The HBase cluster was composed of many many regionservers (in thousand order 
> of magnitude), 
> and connected by tens thousands of clients to do massive reads/writes. 
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as 
> well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. 
> The leader relection will cause the clients(HBase regionservers and HBase 
> clients) disconnected 
> and reconnected with Zookeeper servers in the mean time, and try to renew the 
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first 
> in order to avoid race condition in multi-threads and go iterate the cloned 
> connection set one by one to 
> find the related session to renew. It's very time consuming. In our case 
> (described above), 
> it caused many region servers can't successfully renew session before session 
> timeout, 
> and eventually the HBase cluster lose these region servers and affect the 
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap 
> to store session id and connection map relation, which is a thread-safe data 
> structure 
> and eliminate the necessary to clone the connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104837#comment-16104837
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user CheneySun commented on the issue:

https://github.com/apache/zookeeper/pull/312
  
@eribeiro @hanm Can you review this PR again? Thanks.


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. 
> Actually, we came across the issue during maintaining our HBase cluster, 
> which used a 5-server ZooKeeper cluster. 
> The HBase cluster was composed of many many regionservers (in thousand order 
> of magnitude), 
> and connected by tens thousands of clients to do massive reads/writes. 
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as 
> well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. 
> The leader relection will cause the clients(HBase regionservers and HBase 
> clients) disconnected 
> and reconnected with Zookeeper servers in the mean time, and try to renew the 
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first 
> in order to avoid race condition in multi-threads and go iterate the cloned 
> connection set one by one to 
> find the related session to renew. It's very time consuming. In our case 
> (described above), 
> it caused many region servers can't successfully renew session before session 
> timeout, 
> and eventually the HBase cluster lose these region servers and affect the 
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap 
> to store session id and connection map relation, which is a thread-safe data 
> structure 
> and eliminate the necessary to clone the connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16097650#comment-16097650
 ] 

Hadoop QA commented on ZOOKEEPER-1669:
--

+1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+0 tests included.  The patch appears to be a documentation patch that 
doesn't require tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/895//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/895//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/895//console

This message is automatically generated.

> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. 
> Actually, we came across the issue during maintaining our HBase cluster, 
> which used a 5-server ZooKeeper cluster. 
> The HBase cluster was composed of many many regionservers (in thousand order 
> of magnitude), 
> and connected by tens thousands of clients to do massive reads/writes. 
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as 
> well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. 
> The leader relection will cause the clients(HBase regionservers and HBase 
> clients) disconnected 
> and reconnected with Zookeeper servers in the mean time, and try to renew the 
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first 
> in order to avoid race condition in multi-threads and go iterate the cloned 
> connection set one by one to 
> find the related session to renew. It's very time consuming. In our case 
> (described above), 
> it caused many region servers can't successfully renew session before session 
> timeout, 
> and eventually the HBase cluster lose these region servers and affect the 
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap 
> to store session id and connection map relation, which is a thread-safe data 
> structure 
> and eliminate the necessary to clone the connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16097645#comment-16097645
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user CheneySun commented on the issue:

https://github.com/apache/zookeeper/pull/312
  
@eribeiro @hanm Thanks for your kindly suggestions. In branch-3.5, I found 
the issue was already fixed in [ZOOKEEPER-1504], which was linked with 
[ZOOKEEPER-1347]. So forget it cited above, I really wanted to cite 
[ZOOKEEPER-1504]. Sorry about the confusion.

The changes now are also made to NettyServerCnxn(Factory), and the PR 
description are also updated. Can you continue review the changes?



> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. 
> Actually, we came across the issue during maintaining our HBase cluster, 
> which used a 5-server ZooKeeper cluster. 
> The HBase cluster was composed of many many regionservers (in thousand order 
> of magnitude), 
> and connected by tens thousands of clients to do massive reads/writes. 
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as 
> well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. 
> The leader relection will cause the clients(HBase regionservers and HBase 
> clients) disconnected 
> and reconnected with Zookeeper servers in the mean time, and try to renew the 
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first 
> in order to avoid race condition in multi-threads and go iterate the cloned 
> connection set one by one to 
> find the related session to renew. It's very time consuming. In our case 
> (described above), 
> it caused many region servers can't successfully renew session before session 
> timeout, 
> and eventually the HBase cluster lose these region servers and affect the 
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap 
> to store session id and connection map relation, which is a thread-safe data 
> structure 
> and eliminate the necessary to clone the connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16096415#comment-16096415
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/312
  
@CheneySun Good summary! I've posted those on the JIRA description. Please 
also update the description of this pull request with the same (I can't modify 
your pull request:). 


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. Actually, we came across the issue during maintaining our 
> HBase cluster, which used a 5-server ZooKeeper cluster. The HBase cluster was 
> composed of many many regionservers (in thousand order of magnitude), and 
> connected by tens thousands of clients to do massive reads/writes. Because 
> the r/w throughput is very high, ZooKeeper zxid increased quickly as well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. The leader relection will cause the 
> clients(HBase regionservers and HBase clients) disconnected and reconnected 
> with Zookeeper servers in the mean time, and try to renew the sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first in order to avoid race condition in 
> multi-threads and go iterate the cloned connection set one by one to find the 
> related session to renew. It's very time consuming. In our case
> (described above), it caused many region servers can't successfully renew 
> session before session timeout, and eventually the HBase cluster lose these 
> region servers and affect the HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap to store session id and connection map relation, which is a 
> thread-safe data structure and eliminate the necessary to clone the 
> connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16096271#comment-16096271
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user eribeiro commented on the issue:

https://github.com/apache/zookeeper/pull/312
  
@cheneysun mate, the description you wrote to @mhan should be **both** on 
the JIRA description and this PR comment, not on review comments. It helps 
setup context, motivation, etc. Keep this in mind next time. But add this 
absent pieces accordingly. ;)

Also, you cited [ZOOKEEPER-1347] but, as Michael wrote, it seems an 
unrelated ticket. :thinking: Could elaborate that?


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16096243#comment-16096243
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user eribeiro commented on the issue:

https://github.com/apache/zookeeper/pull/312
  
You can setup Netty by setting system property:


`zookeeper.serverCnxnFactory="org.apache.zookeeper.server.NettyServerCnxnFactory"`

Take a look at some test cases.


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16096094#comment-16096094
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user CheneySun commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/312#discussion_r128729244
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/NIOServerCnxnFactory.java ---
@@ -275,20 +307,9 @@ public synchronized void closeSession(long sessionId) {
 
 @SuppressWarnings("unchecked")
 private void closeSessionWithoutWakeup(long sessionId) {
-HashSet cnxns;
-synchronized (this.cnxns) {
-cnxns = (HashSet)this.cnxns.clone();
-}
-
-for (NIOServerCnxn cnxn : cnxns) {
-if (cnxn.getSessionId() == sessionId) {
-try {
-cnxn.close();
-} catch (Exception e) {
-LOG.warn("exception during session close", e);
-}
-break;
-}
+NIOServerCnxn cnxn = sessionMap.remove(sessionId);
+if (cnxn != null) {
+cnxn.close();
--- End diff --

@eribeiro good catch, I will fix it.


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095982#comment-16095982
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user CheneySun commented on the issue:

https://github.com/apache/zookeeper/pull/312
  
@eribeiro Thanks for your reivew, I will replicate the changes to 
NettyServerCnxn. 

BTW, how to make use of NettyServerCnxn as the underlying transport? The 
NIOServerCnxn is default transport implementation, and I didn't find the knobs 
to switch to use NettyServerCnxn.


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095977#comment-16095977
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user CheneySun commented on the issue:

https://github.com/apache/zookeeper/pull/312
  
@hanm The issue is raised while tens thousands of clients try to reconnect 
ZooKeeper service. Actually, we came across the issue during maintaining our 
HBase cluster, which used a  5-server ZooKeeper cluster. The HBase cluster was 
composed of many many regionservers (in thousand order of magnitude), and 
connected by tens thousands of clients to do massive reads/writes. Because the 
r/w throughput is very high, ZooKeeper zxid increased quickly as well. 
Basically, each two or three weeks, Zookeeper would make leader relection 
triggered by the zxid roll over. The leader relection will cause the 
clients(HBase regionservers and HBase clients) disconnected and reconnected 
with Zookeeper servers in the mean time, and try to renew the sessions. 

In current implementation of session renew, NIOServerCnxnFactory will clone 
all the connections at first in order to avoid race condition in multi-threads 
and go iterate the cloned connection set one by one to find the related session 
to renew. It's very time consuming. In our case (described above), it caused 
many region servers can't  successfully  renew session before session timeout, 
and eventually the HBase cluster lose these region servers and affect the HBase 
stability. 

The change is to make refactoring to the close session logic and introduce 
a ConcurrentHashMap to store session id and connection map relation, which is a 
thread-safe data structure and eliminate the necessary to clone the connection 
set at first. 


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095820#comment-16095820
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user eribeiro commented on the issue:

https://github.com/apache/zookeeper/pull/312
  
 @CheneySun Don't forget to replicate these changes on `NettyServerCnxn` 
and its factory. It's important to let them in sync as much as possible, even 
more if you are adding a new data structure to speed up this part of the code: 
https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/NettyServerCnxnFactory.java#L414-L423


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095814#comment-16095814
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/312#discussion_r128688215
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/NIOServerCnxnFactory.java ---
@@ -62,6 +63,10 @@
 */
 final ByteBuffer directBuffer = ByteBuffer.allocateDirect(64 * 1024);
 
+// sessionMap is used to accelerate closeSession()
+private final ConcurrentHashMap sessionMap =
--- End diff --

`private final ConcurrentMap sessionMap = `


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095812#comment-16095812
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/312#discussion_r128688135
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/NIOServerCnxnFactory.java ---
@@ -275,20 +307,9 @@ public synchronized void closeSession(long sessionId) {
 
 @SuppressWarnings("unchecked")
 private void closeSessionWithoutWakeup(long sessionId) {
-HashSet cnxns;
-synchronized (this.cnxns) {
-cnxns = (HashSet)this.cnxns.clone();
-}
-
-for (NIOServerCnxn cnxn : cnxns) {
-if (cnxn.getSessionId() == sessionId) {
-try {
-cnxn.close();
-} catch (Exception e) {
-LOG.warn("exception during session close", e);
-}
-break;
-}
+NIOServerCnxn cnxn = sessionMap.remove(sessionId);
+if (cnxn != null) {
+cnxn.close();
--- End diff --

Why did you remove the `try-catch` block around `cnxn.close()`? We still 
can have exceptions being thrown during `cnxn.close()`, right?


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095800#comment-16095800
 ] 

Hadoop QA commented on ZOOKEEPER-1669:
--

+1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+0 tests included.  The patch appears to be a documentation patch that 
doesn't require tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/894//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/894//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/894//console

This message is automatically generated.

> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094168#comment-16094168
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/312
  
@CheneySun some quick comments:
* Can you please add more description to the pull request regarding how 
this patch fixes the issue? You mentioned "just porting the work in 
[ZOOKEEPER-1347] ", but I don't see ZOOKEEPER-1347 has a patch nor it's 
committed to master.
* There are some format only changes such as indentation changes - we 
prefer not mixing format change with functional changes in a patch because it 
will make reviewer harder. But for this case I think it's fine because old code 
was not well formatted and format only changes are not too big to review.

I'll take another pass on your patch later this week.


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092905#comment-16092905
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

Github user CheneySun commented on the issue:

https://github.com/apache/zookeeper/pull/312
  
@hanm can you take a review of this PR.


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091520#comment-16091520
 ] 

Hadoop QA commented on ZOOKEEPER-1669:
--

+1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+0 tests included.  The patch appears to be a documentation patch that 
doesn't require tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/885//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/885//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/885//console

This message is automatically generated.

> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091472#comment-16091472
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
---

GitHub user CheneySun opened a pull request:

https://github.com/apache/zookeeper/pull/312

ZOOKEEPER-1669: Operations to server will be timed-out while thousands of 
sessions expired same time

just porting the work in [ZOOKEEPER-1347] to branch 3.4

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/CheneySun/zookeeper branch-3.4

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/312.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #312


commit 59d71077640643f13f036dd67741ef944b48255b
Author: Cheney Sun 
Date:   2017-07-18T12:14:01Z

ZOOKEEPER-1669: Operations to server will be timed-out while thousands of 
sessions expired same time




> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-18 Thread Cheney Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091403#comment-16091403
 ] 

Cheney Sun commented on ZOOKEEPER-1669:
---

Looks like the issue is already addressed in [ZOOKEEPER-1347] and a patch was 
available, but not fixed in 3.4, which is the latest stable version and the one 
we currently use. @Michael, is it meaningful to porting the fix to older 
version, 3.4 or 3.3?

> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>Assignee: Cheney Sun
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-12 Thread Cheney Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085223#comment-16085223
 ] 

Cheney Sun commented on ZOOKEEPER-1669:
---

@Michael, Yes, I would like to fix it. Actually, I have made an initial change, 
which gained a great performance improvement. I will submit the official patch 
in the coming days. 

> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-12 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085182#comment-16085182
 ] 

Michael Han commented on ZOOKEEPER-1669:


I don't think there is any plans or anyone is actively working on this issue. 
It seems to be a good performance improvement that worth doing though.
[~sun.cheney] If you have a fix are you willing to submit a patch? I can help 
review it. Also, if you can share your use case here that will greatly benefit 
the community.

> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2017-07-11 Thread Cheney Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16083400#comment-16083400
 ] 

Cheney Sun commented on ZOOKEEPER-1669:
---

Is there any plan to fix the issue? we have come into the same issue several 
times in past weeks. Looks like the latest version doesn't address the issue.

> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2013-03-21 Thread Jacky007 (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608802#comment-13608802
 ] 

Jacky007 commented on ZOOKEEPER-1669:
-

I think it is. In one of our environment, there are tens of thousands 
connections and 300~500/s close session(these clients create a connection for a 
read, and close it immediately). The codes you described significantly affect 
performance.

> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2013-03-17 Thread tokoot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13604613#comment-13604613
 ] 

tokoot commented on ZOOKEEPER-1669:
---

Thanks for your advice, Jacky007
We have solved it with the same way. And I see the problem is still exist in 
latest version, should we fix it in next one?

> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

2013-03-15 Thread Jacky007 (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603262#comment-13603262
 ] 

Jacky007 commented on ZOOKEEPER-1669:
-

We have paid for this. But the fix is simple, you can hash it when the session 
is created, and find from hash when close it. :)

> Operations to server will be timed-out while thousands of sessions expired 
> same time
> 
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.3.5
>Reporter: tokoot
>  Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>   HashSet cnxns;
>   synchronized (this.cnxns) {
>   cnxns = (HashSet)this.cnxns.clone();  // other 
> thread will block because of here
>   }
>   ...
>   }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira