[ 
https://issues.apache.org/jira/browse/GEODE-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fred Krone updated GEODE-4250:
------------------------------
    Description: 
Acceptance criteria:

-- There is a way for a user to detect that redundancy is restored

-- There is a way to check current redundancy

-- Can set moveBuckets and movePrimary to false and run rebalance

 

Command would only succeed when the system is fully redundant.

Re-establishing Redundancy after the loss of a peer node is typically far more 
urgent and important than achieving better balance. The operational impact of 
rebalancing is also much higher, forcing impacted buckets' updates to be 
distributed to _redundancy-copies + 1_ peer processes and potentially spiking 
p2p connections/threads (and thus load) far beyond normal operations. If the 
system is already close to exhausting available capacity for some hardware 
component, this can be enough to push it over-the-edge (and may force the 
original fault to recur). This problem is exacerbated when the cluster's 
overall capacity has been reduced due to the loss of a physical server. Without 
the ability to separate the operational tasks of re-establishing full data 
redundancy and rebalancing bucket partitions (that are already safely 
redundant), system administrators may be forced to provision replacement 
capacity _before_ they can restore full service, thus increasing downtime 
unnecessarily.

For these reasons, we must add the option to execute these operational tasks 
separately.

It still makes sense for _rebalancing_ ops to first re-establish redundancy, so 
we can keep the existing GFSH command/behavior (it would still be useful to 
clearly log completion of one step before the next one begins). We need a new 
GFSH command/ResourceManager API to execute re-establishment of redundancy 
_without_ rebalancing.

  was:
Command would only succeed when the system is fully redundant.

Re-establishing Redundancy after the loss of a peer node is typically far more 
urgent and important than achieving better balance.  The operational impact of 
rebalancing is also much higher, forcing impacted buckets' updates to be 
distributed to _redundancy-copies + 1_ peer processes and potentially spiking 
p2p connections/threads (and thus load) far beyond normal operations.  If the 
system is already close to exhausting available capacity for some hardware 
component, this can be enough to push it over-the-edge (and may force the 
original fault to recur).    This problem is exacerbated when the cluster's 
overall capacity has been reduced due to the loss of a physical server.  
Without the ability to separate the operational tasks of re-establishing full 
data redundancy and rebalancing bucket partitions (that are already safely 
redundant), system administrators may be forced to provision replacement 
capacity _before_ they can restore full service, thus increasing downtime 
unnecessarily. 

For these reasons, we must add the option to execute these operational tasks 
separately.  

It still makes sense for _rebalancing_ ops to first re-establish redundancy, so 
we can keep the existing GFSH command/behavior (it would still be useful to 
clearly log completion of one step before the next one begins).  We need a new 
GFSH command/ResourceManager API to execute re-establishment of redundancy 
_without_ rebalancing.


> Users would like a command to re-establish redundancy without rebalancing
> -------------------------------------------------------------------------
>
>                 Key: GEODE-4250
>                 URL: https://issues.apache.org/jira/browse/GEODE-4250
>             Project: Geode
>          Issue Type: Improvement
>          Components: docs, regions
>            Reporter: Fred Krone
>            Priority: Major
>
> Acceptance criteria:
> -- There is a way for a user to detect that redundancy is restored
> -- There is a way to check current redundancy
> -- Can set moveBuckets and movePrimary to false and run rebalance
>  
> Command would only succeed when the system is fully redundant.
> Re-establishing Redundancy after the loss of a peer node is typically far 
> more urgent and important than achieving better balance. The operational 
> impact of rebalancing is also much higher, forcing impacted buckets' updates 
> to be distributed to _redundancy-copies + 1_ peer processes and potentially 
> spiking p2p connections/threads (and thus load) far beyond normal operations. 
> If the system is already close to exhausting available capacity for some 
> hardware component, this can be enough to push it over-the-edge (and may 
> force the original fault to recur). This problem is exacerbated when the 
> cluster's overall capacity has been reduced due to the loss of a physical 
> server. Without the ability to separate the operational tasks of 
> re-establishing full data redundancy and rebalancing bucket partitions (that 
> are already safely redundant), system administrators may be forced to 
> provision replacement capacity _before_ they can restore full service, thus 
> increasing downtime unnecessarily.
> For these reasons, we must add the option to execute these operational tasks 
> separately.
> It still makes sense for _rebalancing_ ops to first re-establish redundancy, 
> so we can keep the existing GFSH command/behavior (it would still be useful 
> to clearly log completion of one step before the next one begins). We need a 
> new GFSH command/ResourceManager API to execute re-establishment of 
> redundancy _without_ rebalancing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to