Donal Evans created GEODE-9007:
----------------------------------
Summary: Allow rebalancing of client subscription queues
Key: GEODE-9007
URL: https://issues.apache.org/jira/browse/GEODE-9007
Project: Geode
Issue Type: New Feature
Components: client queues
Reporter: Donal Evans
In clusters where membership changes have led to one server remaining alive
while others are restarted, such as in a rolling restart, it is possible for
almost all clients to have one server set as the primary host for their client
subscription queue, leading to that server becoming overloaded. There is
currently no mechanism for client subscription queues to be moved from an
overloaded server to a less loaded server, or for primary queues to be
automatically reassigned based on server load, meaning that if the cluster gets
into a “bad” state there there is no straightforward way to remedy the
situation.
h4. Goal
Users should have a way to trigger a rebalance of all client subscription
queues for currently connected clients in a cluster.
h4. Requirements
* No queued events should be lost from client subscription queues during the
rebalance process.
* There should be no significant impact on performance during the rebalance
process, both in terms of resource use in the cluster and continuous
dispatching of events to clients.
* The rebalance process should complete in a reasonable amount of time and not
repeat steps.
* Changes in cluster membership and client subscription should not impact the
success of the rebalance process.
* Once the rebalance is complete, the total number of primary queues hosted on
each server should be as close as possible to the average number of primary
queues per server. Depending on client configuration, it may not be possible to
perfectly balance all servers, as certain clients may not have access to
certain servers, restricting which “moves” are possible.
* Once the rebalance is complete, the total number of queues hosted on each
server should be as close as possible to the average number of queues per
server. The caveat regarding the total number of primary queues per server also
applies in this case.
h4. Current Behavior
Some aspects of existing behaviour are relevant or useful to the proposed
implementation for rebalancing client subscription queues. Some behaviours that
have been identified as particularly relevant are listed below:
* Redundancy is automatically restored (or attempted to be restored) when the
client detects that the number of redundant queues is less than the configured
redundancy. In the case that a server shuts down or a ClientCacheProxy or
CacheClientUpdater is closed, this happens immediately, but if the server is
disconnected, it can potentially take up to the configured ping-interval
(default value 10 seconds) for the client to begin restoring redundancy.
* The client contacts a locator for server/load information and then decides
where to best create connections based on existing queue size and
primary/secondary status for, if any, and randomly otherwise, with no
consideration given to server load due to other clients.
* Extra redundant copies are not removed and it should not be possible for
actual redundancy to be greater than the configured redundancy. One possible
exception to this is the case of durable client queues, where it is
conceptually possible for a client to lose connection to a server hosting a
durable queue, create a new durable queue on a different server to restore
redundancy, then recover the connection to the first server before the
configured durable-client-timeout has elapsed.
* Primary queues are not relocated during automatic redundancy restoration. If
a server is hosting the primary queue for a client, that server will remain the
primary until the queue is closed or the server disconnects or stops.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)