Donal Evans created GEODE-9007:
----------------------------------

             Summary: Allow rebalancing of client subscription queues
                 Key: GEODE-9007
                 URL: https://issues.apache.org/jira/browse/GEODE-9007
             Project: Geode
          Issue Type: New Feature
          Components: client queues
            Reporter: Donal Evans


In clusters where membership changes have led to one server remaining alive 
while others are restarted, such as in a rolling restart, it is possible for 
almost all clients to have one server set as the primary host for their client 
subscription queue, leading to that server becoming overloaded. There is 
currently no mechanism for client subscription queues to be moved from an 
overloaded server to a less loaded server, or for primary queues to be 
automatically reassigned based on server load, meaning that if the cluster gets 
into a “bad” state there there is no straightforward way to remedy the 
situation.
h4. Goal

Users should have a way to trigger a rebalance of all client subscription 
queues for currently connected clients in a cluster.

h4. Requirements
 * No queued events should be lost from client subscription queues during the 
rebalance process.
 * There should be no significant impact on performance during the rebalance 
process, both in terms of resource use in the cluster and continuous 
dispatching of events to clients.
 * The rebalance process should complete in a reasonable amount of time and not 
repeat steps.
 * Changes in cluster membership and client subscription should not impact the 
success of the rebalance process.
 * Once the rebalance is complete, the total number of primary queues hosted on 
each server should be as close as possible to the average number of primary 
queues per server. Depending on client configuration, it may not be possible to 
perfectly balance all servers, as certain clients may not have access to 
certain servers, restricting which “moves” are possible.
 * Once the rebalance is complete, the total number of queues hosted on each 
server should be as close as possible to the average number of queues per 
server. The caveat regarding the total number of primary queues per server also 
applies in this case.

h4. Current Behavior
Some aspects of existing behaviour are relevant or useful to the proposed 
implementation for rebalancing client subscription queues. Some behaviours that 
have been identified as particularly relevant are listed below:
* Redundancy is automatically restored (or attempted to be restored) when the 
client detects that the number of redundant queues is less than the configured 
redundancy. In the case that a server shuts down or a ClientCacheProxy or 
CacheClientUpdater is closed, this happens immediately, but if the server is 
disconnected, it can potentially take up to the configured ping-interval 
(default value 10 seconds) for the client to begin restoring redundancy.
* The client contacts a locator for server/load information and then decides 
where to best create connections based on existing queue size and 
primary/secondary status for, if any, and randomly otherwise, with no 
consideration given to server load due to other clients.
* Extra redundant copies are not removed and it should not be possible for 
actual redundancy to be greater than the configured redundancy. One possible 
exception to this is the case of durable client queues, where it is 
conceptually possible for a client to lose connection to a server hosting a 
durable queue, create a new durable queue on a different server to restore 
redundancy, then recover the connection to the first server before the 
configured durable-client-timeout has elapsed.
* Primary queues are not relocated during automatic redundancy restoration. If 
a server is hosting the primary queue for a client, that server will remain the 
primary until the queue is closed or the server disconnects or stops.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to