fapifta commented on a change in pull request #881:
URL: https://github.com/apache/hadoop-ozone/pull/881#discussion_r420070985
##########
File path:
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/SCMCommonPlacementPolicy.java
##########
@@ -198,4 +211,69 @@ public boolean hasEnoughSpace(DatanodeDetails
datanodeDetails,
*/
public abstract DatanodeDetails chooseNode(
List<DatanodeDetails> healthyNodes);
+
+ /**
+ * Default implementation for basic placement policies that do not have a
+ * placement policy. If the policy has not network topology this method
should
+ * return null.
+ * @return The networkTopology for the policy or null if none is configured.
+ */
+ protected NetworkTopology getNetworkTopology() {
+ return null;
+ }
+
+ /**
+ * Default implementation to return the number of racks containers should
span
+ * to meet the placement policy. For simple policies that are not rack aware
+ * we return 1, from this default implementation.
+ * should have
+ * @return The number of racks containers should span to meet the policy
+ */
+ protected int getRequiredRackCount() {
+ return 1;
+ }
+
+ /**
+ * This default implementation handles rack aware policies and non rack
+ * aware policies. If a future placement policy needs to check more than
racks
+ * to validate the policy (eg node groups, HDFS like upgrade domain) this
+ * method should be overridden in the sub class.
+ * This method requires that subclasses which implement rack aware policies
+ * override the default method getRequiredRackCount and getNetworkTopology.
+ * @param dns List of datanodes holding a replica of the container
+ * @param replicas The expected number of replicas
+ * @return ContainerPlacementStatus indicating if the placement policy is
+ * met or not. Not this only considers the rack count and not the
+ * number of replicas.
+ */
+ @Override
+ public ContainerPlacementStatus validateContainerPlacement(
+ List<DatanodeDetails> dns, int replicas) {
+ NetworkTopology topology = getNetworkTopology();
+ int requiredRacks = getRequiredRackCount();
+ if (topology == null || replicas == 1 || requiredRacks == 1) {
+ if (dns.size() > 0) {
+ // placement is always satisfied if there is at least one DN.
+ return validPlacement;
+ } else {
+ return invalidPlacement;
+ }
+ }
Review comment:
This part of the method is clearly dealing with non-topology-aware
stuff, and returns, while the rest is dealing with topology-aware stuff. Why we
don't use polymorphism here to separate the two? I understand that this
requires copying the topology-aware stuff to two places, but that is an other
problem we have to deal with at one point.
So I believe this if should be the default implementation in this class, and
the rest should be inside the rack aware and pipeline placement policy.
This would render the getRequiredRackCount to be unnecessary as well, as
here it would be possible to just remove the expression yielding into true
always for non-topology-aware policies, while the topology aware ones can have
and use the constant in their implementation.
##########
File path:
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java
##########
@@ -393,8 +396,11 @@ private boolean isContainerHealthy(final ContainerInfo
container,
*/
private boolean isContainerUnderReplicated(final ContainerInfo container,
final Set<ContainerReplica> replicas) {
+ boolean misReplicated = !getPlacementStatus(
Review comment:
I understand that this is here to check if we have less than 2 racks
with replicas, and if so, consider the container to be under replicated.
I am unsure though if we want to ensure that a container is replicated to
exactly 2 racks, it does not seems to be the case in the rack aware placement
policy, but if so, then mis-replication should be checked somewhere else also
to handle the case when a container is replicated to 3 racks due to for example
removing a rack from config and place the nodes in a rack to other racks.
(Probably this can go to isContainerOverReplicated(), if we want to check for
this case.)
##########
File path:
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java
##########
@@ -512,25 +523,60 @@ private void handleUnderReplicatedContainer(final
ContainerInfo container,
if (source.size() > 0) {
final int replicationFactor = container
.getReplicationFactor().getNumber();
- final int delta = replicationFactor - getReplicaCount(id, replicas);
+ // Want to check if the container is mis-replicated after considering
+ // inflight add and delete.
+ // Create a new list from source (healthy replicas minus pending
delete)
+ List<DatanodeDetails> targetReplicas = new ArrayList<>(source);
+ // Then add any pending additions
+ targetReplicas.addAll(replicationInFlight);
+
+ int delta = replicationFactor - getReplicaCount(id, replicas);
+ final ContainerPlacementStatus placementStatus =
+ containerPlacement.validateContainerPlacement(
+ targetReplicas, replicationFactor);
+ final int misRepDelta = placementStatus.additionalReplicaRequired();
Review comment:
This method in reality gives back the additional rack count required,
not the additional replicas required.
This is why we need to check against the replication factor and the replica
count difference also, while we provide the replication factor to the
validation method, so it would be able to return the number of replicas really
required.
Let see the different cases:
- non-topology aware placement policies: if replica count == replication
factor we are good, otherwise the required additional replica count is
replication factor - replica count, and we need that many extra replicas later
from chooseDataNodes.
- a topology aware placement policy with replication factor of 1 in case it
is under replicated we already miss the block
- a topology aware placement policy with replication factor of 2 in case it
is under replicated either we miss the block, or we have it in one rack, and
chooseDatanodes will give us the required one in a different rack.
- topology aware placement policies with replication factor of 3 (or more):
- when we have 1 healthy replica, we need 2 additional replicas anyways,
it is not really interesting in which rack.
- when we have 2 healthy replica, we need 1 additional replica from
either one of the racks in which we already have a replica or in an other rack
instead if we have the two replica in the same rack, this the policy already
handles as it uses the excluded node list to anchor to proper racks in
chooseDatanodes().
- when we have 3 healthy replica, but all three is in the same rack, then
additionalReplicaRequired should yield to 1 still, and chooseDataNode should
select a new set of DataNodes correctly, consisting of the nodes already have
the replica, and one additional in a different rack. Which makes the container
over replicated, so one excess replica will be removed later, after replication
finished.
So if the additionalReplicaRequired works as the name suggests, we should
not need to check replication factor and replica count here, but just rely on
the value given back by the method.
##########
File path:
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java
##########
@@ -512,25 +523,60 @@ private void handleUnderReplicatedContainer(final
ContainerInfo container,
if (source.size() > 0) {
final int replicationFactor = container
.getReplicationFactor().getNumber();
- final int delta = replicationFactor - getReplicaCount(id, replicas);
+ // Want to check if the container is mis-replicated after considering
+ // inflight add and delete.
+ // Create a new list from source (healthy replicas minus pending
delete)
+ List<DatanodeDetails> targetReplicas = new ArrayList<>(source);
+ // Then add any pending additions
+ targetReplicas.addAll(replicationInFlight);
+
+ int delta = replicationFactor - getReplicaCount(id, replicas);
+ final ContainerPlacementStatus placementStatus =
+ containerPlacement.validateContainerPlacement(
+ targetReplicas, replicationFactor);
+ final int misRepDelta = placementStatus.additionalReplicaRequired();
+ final int replicasNeeded
+ = delta < misRepDelta ? misRepDelta : delta;
+
final List<DatanodeDetails> excludeList = replicas.stream()
.map(ContainerReplica::getDatanodeDetails)
.collect(Collectors.toList());
+ excludeList.addAll(replicationInFlight);
Review comment:
Do we add the in-flight replications twice here this way, or I am
missing something?
##########
File path:
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java
##########
@@ -512,25 +523,60 @@ private void handleUnderReplicatedContainer(final
ContainerInfo container,
if (source.size() > 0) {
final int replicationFactor = container
.getReplicationFactor().getNumber();
- final int delta = replicationFactor - getReplicaCount(id, replicas);
+ // Want to check if the container is mis-replicated after considering
+ // inflight add and delete.
+ // Create a new list from source (healthy replicas minus pending
delete)
+ List<DatanodeDetails> targetReplicas = new ArrayList<>(source);
+ // Then add any pending additions
+ targetReplicas.addAll(replicationInFlight);
+
+ int delta = replicationFactor - getReplicaCount(id, replicas);
+ final ContainerPlacementStatus placementStatus =
+ containerPlacement.validateContainerPlacement(
+ targetReplicas, replicationFactor);
+ final int misRepDelta = placementStatus.additionalReplicaRequired();
+ final int replicasNeeded
+ = delta < misRepDelta ? misRepDelta : delta;
+
final List<DatanodeDetails> excludeList = replicas.stream()
.map(ContainerReplica::getDatanodeDetails)
.collect(Collectors.toList());
+ excludeList.addAll(replicationInFlight);
List<InflightAction> actionList = inflightReplication.get(id);
if (actionList != null) {
actionList.stream().map(r -> r.datanode)
.forEach(excludeList::add);
}
final List<DatanodeDetails> selectedDatanodes = containerPlacement
- .chooseDatanodes(excludeList, null, delta,
+ .chooseDatanodes(excludeList, null, replicasNeeded,
container.getUsedBytes());
-
- LOG.info("Container {} is under replicated. Expected replica count" +
- " is {}, but found {}.", id, replicationFactor,
- replicationFactor - delta);
-
- for (DatanodeDetails datanode : selectedDatanodes) {
- sendReplicateCommand(container, datanode, source);
+ if (delta > 0) {
+ LOG.info("Container {} is under replicated. Expected replica count" +
+ " is {}, but found {}.", id, replicationFactor,
+ replicationFactor - delta);
+ }
+ int newMisRepDelta = misRepDelta;
+ if (misRepDelta > 0) {
+ LOG.info("Container: {}. {}",
+ id, placementStatus.misReplicatedReason());
+ // Check if the new target nodes (original plus newly selected nodes)
+ // makes the placement policy valid.
+ targetReplicas.addAll(selectedDatanodes);
+ newMisRepDelta = containerPlacement.validateContainerPlacement(
+ targetReplicas, replicationFactor).additionalReplicaRequired();
Review comment:
The rack-aware policy's chooseDatanodes() method we use ensures that
based on the excluded nodes list we get back a proper list, I believe we do not
need to check it twice Are you aware of any cases where this is not true? Based
on my understanding of the logic there I would expect all future placement
policies to do so, as other existing ones implicitly do so now. If necessary we
can add this to the contract of the interface, though this is an interesting
property of the rack-aware policy that would worth some re-thinking but that is
out of scope for now.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]