[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17926786#comment-17926786
 ] 

Chevaris commented on ZOOKEEPER-571:
------------------------------------

Rebalancing the load is a very challenging task depending how you want to 
measure the load (CPU, memory, IO...)

Maybe a good starting point is to balance number of clients connected to each 
server of the cluster (related with ZOOKEEPER-2748). In order to achieve that 
goal, first step in my view is to provide a simple way to access that 
information (how many clients has each server).

If that is provided in a ZNode that has all the aggregated info, clients could 
easily watch that info and potentially take action. For taking action, an API 
is needed to allow the client to disconnect and connect again to other server 
(pretty much the same that  updateListServerListAPI is doing) .

Servers need to:
 *  Publish number of conns per server in a ZNode with a certain frequency 
(e.g. every 60 secs)

Clients need to:
 * Watch previous ZNode and on every change, potentially decide if it is worthy 
to disconnect and connect to other servers based on a function that with given 
probability asses for each client if reconnection is worthy (globally the goal 
is to balance connections, so not all clients connected to most loaded server 
need to disconnect). Function of course need to be discussed
 * If disconnection is needed, just wait a random amount of time before trying 
to disconnect to avoid massive concurrent connection establishment

 

 

> support balancing of client load across servers in an ensemble
> --------------------------------------------------------------
>
>                 Key: ZOOKEEPER-571
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-571
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: quorum, server
>            Reporter: Patrick D. Hunt
>            Priority: Major
>
> Currently the ensemble does not ensure a balanced load across servers in an 
> ensemble. Clients randomly connect to
> a server, which typically balances the number of sessions. However there are 
> problems with this:
> 1) session count is balanced, but not session load
> 2) if server A goes down all of the sessions on that server migrate to other 
> servers in the cluster randomly, this is fine, however
> when server A comes back into service it will have no sessions, and migration 
> of sessions from other servers may take time
> The quorum should probably have some way of broadcasting load, and 
> occasionally re-balance the sessions based on
> this information. Might be tricky though, want to ensure that we aren't 
> constantly ping-ponging sessions to servers.
> Probably need some hysteresis as well as limit the frequency. Real time 
> tuning would need to be supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to