[
https://issues.apache.org/jira/browse/GEODE-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gideon updated GEODE-4250:
--------------------------
Description:
Command would only succeed when the system is fully redundant.
Re-establishing Redundancy after the loss of a peer node is typically far more
urgent and important than achieving better balance. The operational impact of
rebalancing is also much higher, forcing impacted buckets' updates to be
distributed to _redundancy-copies + 1_ peer processes and potentially spiking
p2p connections/threads (and thus load) far beyond normal operations. If the
system is already close to exhausting available capacity for some hardware
component, this can be enough to push it over-the-edge (and may force the
original fault to recur). This problem is exacerbated when the cluster's
overall capacity has been reduced due to the loss of a physical server.
Without the ability to separate the operational tasks of re-establishing full
data redundancy and rebalancing bucket partitions (that are already safely
redundant), system administrators may be forced to provision replacement
capacity _before_ they can restore full service, thus increasing downtime
unnecessarily.
For these reasons, we must add the option to execute these operational
separately.
It still makes sense for _rebalancing_ ops to first re-establish redundancy, so
we can keep the existing GFSH command/behavior (it would still be useful to
clearly log completion of one step before the next one begins). We need a new
GFSH command/ResourceManager API to execute re-establishment of redundancy
_without_ rebalancing.
was:Command would only succeed when the system is fully redundant.
> Users would like a command to wait for redundancy to be satisfied before
> rebalance
> ----------------------------------------------------------------------------------
>
> Key: GEODE-4250
> URL: https://issues.apache.org/jira/browse/GEODE-4250
> Project: Geode
> Issue Type: Improvement
> Components: regions
> Reporter: Fred Krone
> Priority: Major
>
> Command would only succeed when the system is fully redundant.
> Re-establishing Redundancy after the loss of a peer node is typically far
> more urgent and important than achieving better balance. The operational
> impact of rebalancing is also much higher, forcing impacted buckets' updates
> to be distributed to _redundancy-copies + 1_ peer processes and potentially
> spiking p2p connections/threads (and thus load) far beyond normal operations.
> If the system is already close to exhausting available capacity for some
> hardware component, this can be enough to push it over-the-edge (and may
> force the original fault to recur). This problem is exacerbated when the
> cluster's overall capacity has been reduced due to the loss of a physical
> server. Without the ability to separate the operational tasks of
> re-establishing full data redundancy and rebalancing bucket partitions (that
> are already safely redundant), system administrators may be forced to
> provision replacement capacity _before_ they can restore full service, thus
> increasing downtime unnecessarily.
> For these reasons, we must add the option to execute these operational
> separately.
> It still makes sense for _rebalancing_ ops to first re-establish redundancy,
> so we can keep the existing GFSH command/behavior (it would still be useful
> to clearly log completion of one step before the next one begins). We need a
> new GFSH command/ResourceManager API to execute re-establishment of
> redundancy _without_ rebalancing.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)