swamirishi commented on PR #3836:
URL: https://github.com/apache/ozone/pull/3836#issuecomment-1307935897
> ```
> Rack 1: 1,2
> Rack 2: 3,4
> Rack 3: 1
> Rack 4: 1
> ```
>
> In this example the over rep handler should remove 2 "1" replicas, from
rack1 and then either 3 or 4, leaving:
>
> ```
> Rack 1: 2
> Rack 2: 3,4
> Rack 3: 1
> Rack 4:
> ```
>
> However you are correct, in that making an extra copy of "1" doesn't do
much good.
>
> I think we need to take a step back here. What does it mean to be
mis-replicated?
>
> It means that the replicas are not spread across enough racks. If there
are less racks on the cluster than the replicaNum, then it is also fine for
there to be 2 replicas per rack for example.
>
> Assuming there are plenty of racks, for a container to be mis-replicated
when all the replicas are present, there must be some racks hosting more than 1
replica. You can further extend this, and say, there must be some racks hosting
more than 1 replica, where the replica is not also on another rack.
>
> ```
> R1: 1, 2
> R2: 1, 2
> R3: 3
> R4: 4
> R5: 5
> ```
>
> Above is not mis-replicated, it is simply over-replicated.
>
> A significant complication in the solution to this problem is that a
container can be both over-replicated and mis-replicated at the same time. If
we remove the over-replication part, then the problem becomes simpler, as can
then move any random replica from a rack with more than 1 index.
>
> One idea that is worth thinking about, what if we changed the
ECReplicationCheckHandler, to return health states in this order:
>
> underReplicated overReplicated misReplicated
>
> If a container is both over and mis replicated, rather than its state
being mis-replicated (actually under-replicated due to mis-replication), it
will return as over-replicated. Once the over-replication gets fixed, it will
be re-processed and come out as mis-replicated.
>
> Of course, fixing the mis-replication will cause it to go over replicated
again, but I feel this over + mis-replicated will be a relatively rare
occurrence in practice.
>
> Alternatively, I wonder if the algorithm like this will work even with
over-replication:
>
> ```
> for each rack_group
> if replicas_on_rack > 1
> for each replica_on_rack
> if (another_copy_of_replica_exists)
> continue // do nothing as we don't need to copy it
> else
> copy_to_new_rack
> end if
> end
> end
> ```
>
> I think this would handle these scenarios:
>
> ```
> R1: 1, 2
> R2: 2, 3
> R3: 1
> R4: 4
> R5: 5
>
> R1: 1, 2, 4
> R2: 2, 3
> R3: 1
> R4:
> R5: 5
>
> R1: 1, 1, 2
> R2: 2, 3
> R3:
> R4: 4
> R5: 5
> ```
>
> If the above works, then we just need 2 maps:
>
> replica_index -> count_of_occurrences rack -> List<ContainerReplica>
>
> EDIT:
>
> ```
> R1: 1, 1, 2, 2
> R2:
> R3: 3
> R4: 4
> R5: 5
> ```
>
> This scenario would not work with the above algorithm, as both 1 and 2
have 2 occurrences, so we would skip them both. So we would have to say
has_another_occurrence_on_a_different_rack, but then we also have to consider
the distinct list on each rack. So if we had:
>
> replica_index -> rack_count rack -> Set<ReplicaIndex>
>
> ```
> for each rack_group
> if replicas_on_rack > 1
> max_to_move = replicas_on_rack - 1
> moved_from_rack = 0
> for each replica_on_rack
> if (replica_on_multiple_racks)
> continue // do nothing as we don't need to copy it
> else
> copy_to_new_rack
> increment replica_index_to_rack_count_map
> moved_from_rack += 1
> end if
> if (moved_from_rack >= max_to_move)
> break;
> end
> end
> end
> ```
The current Algorithm I have implemented takes into consideration of all the
following edge cases.
**ECPlacementManager.getPlacementGroupReplicaIndexMap:**
The algorithm groups starts with the replicas based on the placement
group(Map<PlacementGroup, Set<Integer>>)
**ECPlacementManager.getMisreplicatedIndexes(Map<PlacementGroup,
Set<Integer>>)**
This function starts with getting the reverseMap i.e. replicationIndex's
spread across the different placement group (Map<Integer, Set<PlacementGroup>>).
Now we create 2 ConcurrentSkipListSet<PlacementGroup> &
ConcurrentSkipListSet<Integer> for replicas & placement group respectively
sorted based on the size of set in value.
We pop out any placement group having <=1 replicationIndex, this is why we
have ConcurrentSkipListSet<PlacementGroup>
The replica Index coming out of it is not misreplicated, thus we remove this
index from our consideration thus remove the replica index from all the
placementGroups Map. We continue to do the same process till there is no
placement group containing <= 1 replica indexes.
Now we pop out the replica Index with least number of replicas & we consider
that replica to be misreplicated & we remove this replication index from our
consideration & remove this replica index from all the placement Group.
This loop runs till there are no more misreplicated indexes.
This function would return the indexes to be copied to fix misreplication.
Existing OverreplicationHandler can fix the overreplication post this.
Let us consider the case & do a dry run:
PlacementGroupReplicaMap
Rack 1: 1,2
Rack 2: 3,4
Rack 3: 5,3
Rack 4: 1
Here misreplicationCnt = 1, Overreplicated Idx = 1 with 2 replicas
Reverse Replica Map:
1 : Rack 1, Rack 4
2: Rack 1
3: Rack 2, Rack 3
4: Rack 2
5: Rack 3
MisreplicatedIndexesList = []
ConcurrentListSet<PlacmentGroup> = Rack4, Rack1, Rack2, Rack3
ConcurrentListSet<Index> = 2,4,5,1,3
**Iteration 1:**
Rack 4 Size = 1 , Contains Replica Index = 1
Remove Rack 4 & Replica Index 1 from all Placment Group
then:
PlacementGroupReplicaMap
Rack 1: 2
Rack 2: 3,4
Rack 3: 5,3
Rack 4:
Reverse Replica Map:
2: Rack 1
3: Rack 2, Rack 3
4: Rack 2
5: Rack 3
ConcurrentListSet<PlacmentGroup> = Rack1, Rack2, Rack3
ConcurrentListSet<Index> = 2,4,5,3
MisreplicatedIndexesList = []
**Iteration 2:**
Rack 1 Size = 1 , Contains Replica Index = 2
Remove Rack 1 & Replica Index 2 from all Placment Group
then:
PlacementGroupReplicaMap
Rack 1:
Rack 2: 3,4
Rack 3: 5,3
Rack 4:
Reverse Replica Map:
1:
2:
3: Rack 2, Rack 3
4: Rack 2
5: Rack 3
ConcurrentListSet<PlacmentGroup> = Rack2, Rack3
ConcurrentListSet<Index> = 4,5,3
MisreplicatedIndexesList = []
**Iteration 3:**
Rack 2 Size = 2 , Contains Replica Index = 3,4
As size > 1 pop out from ReplicaIndexSkipList which is replica Index 4 & add
it to MisreplicatedIndexesList & remove replica index 4 from all placement
groups
then:
PlacementGroupReplicaMap
Rack 1:
Rack 2: 3
Rack 3: 5,3
Rack 4:
Reverse Replica Map:
1:
2:
3: Rack 2, Rack 3
4:
5: Rack 3
ConcurrentListSet<PlacmentGroup> = Rack2, Rack3
ConcurrentListSet<Index> = 5,3
MisreplicatedIndexesList = [4]
**Iteration 4:**
Rack 2 Size = 1 , Contains Replica Index = 3
Remove Rack 2 & Replica Index 3 from all Placment Group
then:
**PlacementGroupReplicaMap**
Rack 1:
Rack 2:
Rack 3: 5
Rack 4:
**Reverse Replica Map:**
1:
2:
3:
4:
5: Rack 3
**ConcurrentListSet<PlacmentGroup>** = Rack3
**ConcurrentListSet<Index>** = 5
**MisreplicatedIndexesList** = [4]
**Iteration 5:**
Rack 3 Size = 1 , Contains Replica Index = 5
Remove Rack 3 & Replica Index 5 from all Placment Group
then:
**PlacementGroupReplicaMap**
Rack 1:
Rack 2:
Rack 3:
Rack 4:
**Reverse Replica Map:**
1:
2:
3:
4:
5:
**ConcurrentListSet<PlacmentGroup>** =
**ConcurrentListSet<Index>** =
**MisreplicatedIndexesList** = [4]
Loop ends since ConcurrentListSet<Index> is empty
Function returns MisreplicatedIndexesList.
The starting misreplicationCount elements from this list is taken to be
copied to different racks.
Thus here:
Replica Index 4 would be copied to Rack 5
Thus making:
Rack 1: 1,2
Rack 2: 3,4
Rack 3: 5,3
Rack 4: 1
Rack 5: 4
Over replication Handler can delete Replica 1 from Rack 1, Replica 3 from
Rack 3 & Replica 4 from Rack 2 making the spread:
Rack 1: 2
Rack 2: 3
Rack 3: 5
Rack 4: 1
Rack 5: 4
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]