[ 
https://issues.apache.org/jira/browse/HDDS-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974366#comment-16974366
 ] 

Stephen O'Donnell commented on HDDS-2459:
-----------------------------------------

In the decommission design doc, we had an algorithm to determine the number of 
replicas that need to be created or destroy so a container can be perfectly 
replicated. The algorithm was:

{code}
/**
 * Calculate the number of the missing replicas.
 * 
 * @return the number of the missing replicas. If it's less than zero, the 
container is over replicated.
 */
int getReplicationCount(int expectedCount, int healthy, 
   int maintenance, int inFlight) {

   //for over replication, count only with the healthy replicas
   if (expectedCount < healthy) {
      return expectedCount - healthy;
   }
   
   replicaCount = expectedCount - (healthy + maintenance + inFlight);

   if (replicaCount == 0 && healthy < 1) {
      replicaCount ++;
   }
   
   //over replication is already handled
   return Math.max(0, replicaCount);
}
{code}

Reflecting on this for some time, I think it is a little too simplistic and 
would propose the following instead. One key difference in the logic below is 
that maintenance replicas are not considered when calculating over replicated. 
This is because a maintenance copy cannot be removed (the node is offline) and 
there is not insignificant change the node will fail to come back online, 
resulting in all its replicas getting lost.

{code}
  /**
   * Calculates the the delta of replicas which need to be created or removed
   * to ensure the container is correctly replicated.
   *
   * Decisions around over-replication are made only on healthy replicas,
   * ignoring any in maintenance and also any inflight adds. InFlight adds are
   * ignored, as they may not complete, so if we have:
   *
   *     H, H, H, IN_FLIGHT_ADD
   *
   * And then schedule a delete, we could end up under-replicated (add fails,
   * delete completes). It is better to let the inflight operations complete
   * and then deal with any further over or under replication.
   *
   * For maintenance replicas, assuming replication factor 3, and minHealthy
   * 2, it is possible for all 3 hosts to be put into maintenance, leaving the
   * following (H = healthy, M = maintenance):
   *
   *     H, H, M, M, M
   *
   * Even though we are tracking 5 replicas, this is not over replicated as we
   * ignore the maintenance copies. Later, the replicas could look like:
   *
   *     H, H, H, H, M
   *
   * At this stage, the container is over replicated by 1, so one replica can be
   * removed.
   *
   * For containers which have replication factor healthy replica, we ignore any
   * inflight add or deletes, as they may fail. Instead, wait for them to
   * complete and then deal with any excess or deficit.
   *
   * For under replicated containers we do consider inflight add and delete to
   * avoid scheduling more adds than needed. There is additional logic around
   * containers with maintenance replica to ensure minHealthyForMaintenance
   * replia are maintained/
   *
   * @return Delta of replicas needed. Negative indicates over replication and
   *         containers should be removed. Positive indicates over replication
   *         and zero indicates the containers has replicationFactor healthy
   *         replica
   */
  public int additionalReplicaNeeded() {
    int blockDelta = 0;
    int delta = repFactor - healthyCount;

    if (delta < 0) {
      // Over replicated, so may need to remove a block. Do not consider
      // inFlightAdds, as they may fail, but do consider inFlightDel which
      // will reduce the over-replication if it completes.
      blockDelta = delta  + inFlightDel;
    } else if (delta > 0) {
      // May be under-replicated, depending on maintenance. When a container is
      // under-replicated, we must consider inflight add and delete when
      // calculating the new containers needed.
      if (maintenanceCount != 0) {
        // Remove maintenance copies from delta to see if it is really
        // under-replicated.
        delta = Math.max(0, delta - maintenanceCount);
        // Check we have enough healthy replicas
        int neededHealthy =
            Math.max(0, minHealthyForMaintenance - healthyCount);
        delta = Math.max(neededHealthy, delta);
      }
      blockDelta = delta - inFlightAdd + inFlightDel;
    } else { // delta == 0
      // We have exactly the number of healthy replicas needed, but there may
      // be inflight add or delete. Ignore them until they complete or fail
      // and then deal with the excess or deficit.
      blockDelta = delta;
    }
    return blockDelta;
{code}

The following logic also describes the conditions the replica for a container 
must meet to be considered sufficiently replicated - note that inflight adds 
are ignored and inflight deletes are considered until they complete:

{code}
  /**
   * Return true if the container is sufficiently replicated. Decommissioning
   * and Decommissioned containers are ignored in this check, assuming they will
   * eventually be removed from the cluster.
   * This check ignores inflight additions, as those replicas have not yet been
   * created and the create could fail for some reason.
   * The check does consider inflight deletes as there may be 3 healthy replicas
   * now, but once the delete completes it will reduce to 2.
   * We also assume a replica in Maintenance state cannot be removed, so the
   * pending delete would affect only the healthy replica count.
   *
   * @return True if the container is sufficiently replicated and False
   *         otherwise.
   */
  public boolean isSufficientlyReplicated() {
    return (healthyCount + maintenanceCount - inFlightDel) >= repFactor
        && healthyCount - inFlightDel >= minHealthyForMaintenance;
  }
{code}

> Refactor ReplicationManager to consider maintenance states
> ----------------------------------------------------------
>
>                 Key: HDDS-2459
>                 URL: https://issues.apache.org/jira/browse/HDDS-2459
>             Project: Hadoop Distributed Data Store
>          Issue Type: Sub-task
>          Components: SCM
>    Affects Versions: 0.5.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>
> In its current form the replication manager does not consider decommission or 
> maintenance states when checking if replicas are sufficiently replicated. 
> With the introduction of maintenance states, it needs to consider 
> decommission and maintenance states when deciding if blocks are over or under 
> replicated.
> It also needs to provide an API to allow the decommission manager to check if 
> blocks are over or under replicated, so the decommission manager can decide 
> if a node has completed decommission and maintenance or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to