[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16096415#comment-16096415
 ] 

ASF GitHub Bot commented on ZOOKEEPER-1669:
-------------------------------------------

Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/312
  
    @CheneySun Good summary! I've posted those on the JIRA description. Please 
also update the description of this pull request with the same (I can't modify 
your pull request:). 


> Operations to server will be timed-out while thousands of sessions expired 
> same time
> ------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1669
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.5
>            Reporter: tokoot
>            Assignee: Cheney Sun
>              Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>       HashSet<NIOServerCnxn> cnxns;
>           synchronized (this.cnxns) {
>               cnxns = (HashSet<NIOServerCnxn>)this.cnxns.clone();  // other 
> thread will block because of here
>           }
>       ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. Actually, we came across the issue during maintaining our 
> HBase cluster, which used a 5-server ZooKeeper cluster. The HBase cluster was 
> composed of many many regionservers (in thousand order of magnitude), and 
> connected by tens thousands of clients to do massive reads/writes. Because 
> the r/w throughput is very high, ZooKeeper zxid increased quickly as well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. The leader relection will cause the 
> clients(HBase regionservers and HBase clients) disconnected and reconnected 
> with Zookeeper servers in the mean time, and try to renew the sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first in order to avoid race condition in 
> multi-threads and go iterate the cloned connection set one by one to find the 
> related session to renew. It's very time consuming. In our case
> (described above), it caused many region servers can't successfully renew 
> session before session timeout, and eventually the HBase cluster lose these 
> region servers and affect the HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap to store session id and connection map relation, which is a 
> thread-safe data structure and eliminate the necessary to clone the 
> connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to