[
https://issues.apache.org/jira/browse/HDDS-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974366#comment-16974366
]
Stephen O'Donnell commented on HDDS-2459:
-----------------------------------------
In the decommission design doc, we had an algorithm to determine the number of
replicas that need to be created or destroy so a container can be perfectly
replicated. The algorithm was:
{code}
/**
* Calculate the number of the missing replicas.
*
* @return the number of the missing replicas. If it's less than zero, the
container is over replicated.
*/
int getReplicationCount(int expectedCount, int healthy,
int maintenance, int inFlight) {
//for over replication, count only with the healthy replicas
if (expectedCount < healthy) {
return expectedCount - healthy;
}
replicaCount = expectedCount - (healthy + maintenance + inFlight);
if (replicaCount == 0 && healthy < 1) {
replicaCount ++;
}
//over replication is already handled
return Math.max(0, replicaCount);
}
{code}
Reflecting on this for some time, I think it is a little too simplistic and
would propose the following instead. One key difference in the logic below is
that maintenance replicas are not considered when calculating over replicated.
This is because a maintenance copy cannot be removed (the node is offline) and
there is not insignificant change the node will fail to come back online,
resulting in all its replicas getting lost.
{code}
/**
* Calculates the the delta of replicas which need to be created or removed
* to ensure the container is correctly replicated.
*
* Decisions around over-replication are made only on healthy replicas,
* ignoring any in maintenance and also any inflight adds. InFlight adds are
* ignored, as they may not complete, so if we have:
*
* H, H, H, IN_FLIGHT_ADD
*
* And then schedule a delete, we could end up under-replicated (add fails,
* delete completes). It is better to let the inflight operations complete
* and then deal with any further over or under replication.
*
* For maintenance replicas, assuming replication factor 3, and minHealthy
* 2, it is possible for all 3 hosts to be put into maintenance, leaving the
* following (H = healthy, M = maintenance):
*
* H, H, M, M, M
*
* Even though we are tracking 5 replicas, this is not over replicated as we
* ignore the maintenance copies. Later, the replicas could look like:
*
* H, H, H, H, M
*
* At this stage, the container is over replicated by 1, so one replica can be
* removed.
*
* For containers which have replication factor healthy replica, we ignore any
* inflight add or deletes, as they may fail. Instead, wait for them to
* complete and then deal with any excess or deficit.
*
* For under replicated containers we do consider inflight add and delete to
* avoid scheduling more adds than needed. There is additional logic around
* containers with maintenance replica to ensure minHealthyForMaintenance
* replia are maintained/
*
* @return Delta of replicas needed. Negative indicates over replication and
* containers should be removed. Positive indicates over replication
* and zero indicates the containers has replicationFactor healthy
* replica
*/
public int additionalReplicaNeeded() {
int blockDelta = 0;
int delta = repFactor - healthyCount;
if (delta < 0) {
// Over replicated, so may need to remove a block. Do not consider
// inFlightAdds, as they may fail, but do consider inFlightDel which
// will reduce the over-replication if it completes.
blockDelta = delta + inFlightDel;
} else if (delta > 0) {
// May be under-replicated, depending on maintenance. When a container is
// under-replicated, we must consider inflight add and delete when
// calculating the new containers needed.
if (maintenanceCount != 0) {
// Remove maintenance copies from delta to see if it is really
// under-replicated.
delta = Math.max(0, delta - maintenanceCount);
// Check we have enough healthy replicas
int neededHealthy =
Math.max(0, minHealthyForMaintenance - healthyCount);
delta = Math.max(neededHealthy, delta);
}
blockDelta = delta - inFlightAdd + inFlightDel;
} else { // delta == 0
// We have exactly the number of healthy replicas needed, but there may
// be inflight add or delete. Ignore them until they complete or fail
// and then deal with the excess or deficit.
blockDelta = delta;
}
return blockDelta;
{code}
The following logic also describes the conditions the replica for a container
must meet to be considered sufficiently replicated - note that inflight adds
are ignored and inflight deletes are considered until they complete:
{code}
/**
* Return true if the container is sufficiently replicated. Decommissioning
* and Decommissioned containers are ignored in this check, assuming they will
* eventually be removed from the cluster.
* This check ignores inflight additions, as those replicas have not yet been
* created and the create could fail for some reason.
* The check does consider inflight deletes as there may be 3 healthy replicas
* now, but once the delete completes it will reduce to 2.
* We also assume a replica in Maintenance state cannot be removed, so the
* pending delete would affect only the healthy replica count.
*
* @return True if the container is sufficiently replicated and False
* otherwise.
*/
public boolean isSufficientlyReplicated() {
return (healthyCount + maintenanceCount - inFlightDel) >= repFactor
&& healthyCount - inFlightDel >= minHealthyForMaintenance;
}
{code}
> Refactor ReplicationManager to consider maintenance states
> ----------------------------------------------------------
>
> Key: HDDS-2459
> URL: https://issues.apache.org/jira/browse/HDDS-2459
> Project: Hadoop Distributed Data Store
> Issue Type: Sub-task
> Components: SCM
> Affects Versions: 0.5.0
> Reporter: Stephen O'Donnell
> Assignee: Stephen O'Donnell
> Priority: Major
>
> In its current form the replication manager does not consider decommission or
> maintenance states when checking if replicas are sufficiently replicated.
> With the introduction of maintenance states, it needs to consider
> decommission and maintenance states when deciding if blocks are over or under
> replicated.
> It also needs to provide an API to allow the decommission manager to check if
> blocks are over or under replicated, so the decommission manager can decide
> if a node has completed decommission and maintenance or not.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]