swamirishi commented on PR #3836:
URL: https://github.com/apache/ozone/pull/3836#issuecomment-1307935897

   > ```
   > Rack 1: 1,2
   > Rack 2: 3,4
   > Rack 3: 1
   > Rack 4: 1
   > ```
   > 
   > In this example the over rep handler should remove 2 "1" replicas, from 
rack1 and then either 3 or 4, leaving:
   > 
   > ```
   > Rack 1: 2
   > Rack 2: 3,4
   > Rack 3: 1
   > Rack 4:
   > ```
   > 
   > However you are correct, in that making an extra copy of "1" doesn't do 
much good.
   > 
   > I think we need to take a step back here. What does it mean to be 
mis-replicated?
   > 
   > It means that the replicas are not spread across enough racks. If there 
are less racks on the cluster than the replicaNum, then it is also fine for 
there to be 2 replicas per rack for example.
   > 
   > Assuming there are plenty of racks, for a container to be mis-replicated 
when all the replicas are present, there must be some racks hosting more than 1 
replica. You can further extend this, and say, there must be some racks hosting 
more than 1 replica, where the replica is not also on another rack.
   > 
   > ```
   > R1: 1, 2
   > R2: 1, 2
   > R3: 3
   > R4: 4
   > R5: 5
   > ```
   > 
   > Above is not mis-replicated, it is simply over-replicated.
   > 
   > A significant complication in the solution to this problem is that a 
container can be both over-replicated and mis-replicated at the same time. If 
we remove the over-replication part, then the problem becomes simpler, as can 
then move any random replica from a rack with more than 1 index.
   > 
   > One idea that is worth thinking about, what if we changed the 
ECReplicationCheckHandler, to return health states in this order:
   > 
   > underReplicated overReplicated misReplicated
   > 
   > If a container is both over and mis replicated, rather than its state 
being mis-replicated (actually under-replicated due to mis-replication), it 
will return as over-replicated. Once the over-replication gets fixed, it will 
be re-processed and come out as mis-replicated.
   > 
   > Of course, fixing the mis-replication will cause it to go over replicated 
again, but I feel this over + mis-replicated will be a relatively rare 
occurrence in practice.
   > 
   > Alternatively, I wonder if the algorithm like this will work even with 
over-replication:
   > 
   > ```
   > for each rack_group
   >   if replicas_on_rack > 1
   >     for each replica_on_rack
   >       if (another_copy_of_replica_exists)
   >         continue // do nothing as we don't need to copy it
   >       else
   >         copy_to_new_rack
   >       end if
   >     end
   > end
   > ```
   > 
   > I think this would handle these scenarios:
   > 
   > ```
   > R1: 1, 2
   > R2: 2, 3
   > R3: 1
   > R4: 4
   > R5: 5
   > 
   > R1: 1, 2, 4
   > R2: 2, 3
   > R3: 1
   > R4:
   > R5: 5
   > 
   > R1: 1, 1, 2
   > R2: 2, 3
   > R3: 
   > R4: 4
   > R5: 5
   > ```
   > 
   > If the above works, then we just need 2 maps:
   > 
   > replica_index -> count_of_occurrences rack -> List<ContainerReplica>
   > 
   > EDIT:
   > 
   > ```
   > R1: 1, 1, 2, 2
   > R2: 
   > R3: 3
   > R4: 4
   > R5: 5
   > ```
   > 
   > This scenario would not work with the above algorithm, as both 1 and 2 
have 2 occurrences, so we would skip them both. So we would have to say 
has_another_occurrence_on_a_different_rack, but then we also have to consider 
the distinct list on each rack. So if we had:
   > 
   > replica_index -> rack_count rack -> Set<ReplicaIndex>
   > 
   > ```
   > for each rack_group
   >   if replicas_on_rack > 1
   >     max_to_move = replicas_on_rack - 1
   >     moved_from_rack = 0
   >     for each replica_on_rack
   >       if (replica_on_multiple_racks)
   >         continue // do nothing as we don't need to copy it
   >       else
   >         copy_to_new_rack
   >         increment replica_index_to_rack_count_map
   >         moved_from_rack += 1
   >       end if
   >       if (moved_from_rack >= max_to_move)
   >         break;
   >     end
   >   end
   > end
   > ```
   
   The current Algorithm I have implemented takes into consideration of all the 
following edge cases.
   **ECPlacementManager.getPlacementGroupReplicaIndexMap:**
   The algorithm groups starts with the replicas based on the placement 
group(Map<PlacementGroup, Set<Integer>>) 
   **ECPlacementManager.getMisreplicatedIndexes(Map<PlacementGroup, 
Set<Integer>>)**
   This function starts with getting the reverseMap i.e. replicationIndex's 
spread across the different placement group (Map<Integer, Set<PlacementGroup>>).
   
   Now we create 2  ConcurrentSkipListSet<PlacementGroup> &  
ConcurrentSkipListSet<Integer> for replicas & placement group respectively 
sorted based on the size of set in value.
   
   We pop out any placement group having  <=1 replicationIndex, this is why we 
have ConcurrentSkipListSet<PlacementGroup>
   The replica Index coming out of it is not misreplicated, thus we remove this 
index from our consideration thus remove the replica index from all the 
placementGroups Map. We continue to do the same process till there is no 
placement group containing <= 1 replica indexes.
   
   Now we pop out the replica Index with least number of replicas & we consider 
that replica to be misreplicated & we remove this replication index from our 
consideration & remove this replica index from all the placement Group.
   This loop runs till there are no more misreplicated indexes.
   This function would return the indexes to be copied to fix misreplication.
   
   Existing OverreplicationHandler can fix the overreplication post this.
   
   Let us consider the case & do a dry run:
   PlacementGroupReplicaMap
   Rack 1: 1,2
   Rack 2: 3,4
   Rack 3: 5,3
   Rack 4: 1
   
   Here misreplicationCnt = 1, Overreplicated Idx = 1 with 2 replicas
   Reverse Replica Map:
   1 : Rack 1, Rack 4
   2: Rack 1
   3: Rack 2, Rack 3
   4: Rack 2
   5: Rack 3
   
   MisreplicatedIndexesList = []
   ConcurrentListSet<PlacmentGroup> = Rack4, Rack1, Rack2, Rack3
   ConcurrentListSet<Index> = 2,4,5,1,3
   
   **Iteration 1:**
   Rack 4 Size  = 1 , Contains Replica Index = 1
   Remove Rack 4 & Replica Index 1 from all Placment Group
   then:
   PlacementGroupReplicaMap
   Rack 1: 2
   Rack 2: 3,4
   Rack 3: 5,3
   Rack 4: 
   Reverse Replica Map:
   2: Rack 1
   3: Rack 2, Rack 3
   4: Rack 2
   5: Rack 3
   ConcurrentListSet<PlacmentGroup> = Rack1, Rack2, Rack3
   ConcurrentListSet<Index> = 2,4,5,3
   MisreplicatedIndexesList = []
   
   **Iteration 2:**
   Rack 1 Size  = 1 , Contains Replica Index = 2
   Remove Rack 1 & Replica Index 2 from all Placment Group
   then:
   PlacementGroupReplicaMap
   Rack 1:
   Rack 2: 3,4
   Rack 3: 5,3
   Rack 4: 
   Reverse Replica Map: 
   1:
   2:
   3: Rack 2, Rack 3
   4: Rack 2
   5: Rack 3
   ConcurrentListSet<PlacmentGroup> =  Rack2, Rack3
   ConcurrentListSet<Index> = 4,5,3
   MisreplicatedIndexesList = []
   
   
   **Iteration 3:**
   Rack 2 Size  = 2 , Contains Replica Index = 3,4
   As size > 1 pop out from ReplicaIndexSkipList which is replica Index 4 & add 
it to MisreplicatedIndexesList & remove replica index 4 from all placement 
groups
   then:
   PlacementGroupReplicaMap
   Rack 1:
   Rack 2: 3
   Rack 3: 5,3
   Rack 4: 
   Reverse Replica Map: 
   1:
   2:
   3: Rack 2, Rack 3
   4: 
   5: Rack 3
   ConcurrentListSet<PlacmentGroup> =  Rack2, Rack3
   ConcurrentListSet<Index> = 5,3
   MisreplicatedIndexesList = [4]
   
   **Iteration 4:**
   Rack 2 Size  = 1 , Contains Replica Index = 3
   Remove Rack 2 & Replica Index 3 from all Placment Group
   then:
   **PlacementGroupReplicaMap**
   Rack 1:
   Rack 2: 
   Rack 3: 5
   Rack 4: 
   **Reverse Replica Map:** 
   1:
   2:
   3: 
   4: 
   5: Rack 3
   **ConcurrentListSet<PlacmentGroup>** =  Rack3
   **ConcurrentListSet<Index>** = 5
   **MisreplicatedIndexesList** = [4]
   
   **Iteration 5:**
   Rack 3 Size  = 1 , Contains Replica Index = 5
   Remove Rack 3 & Replica Index 5 from all Placment Group
   then:
   **PlacementGroupReplicaMap**
   Rack 1:
   Rack 2: 
   Rack 3: 
   Rack 4: 
   **Reverse Replica Map:** 
   1:
   2:
   3: 
   4: 
   5:
   **ConcurrentListSet<PlacmentGroup>** =  
   **ConcurrentListSet<Index>** = 
   **MisreplicatedIndexesList** = [4]
   
   Loop ends since ConcurrentListSet<Index> is empty
   Function returns MisreplicatedIndexesList.
   The starting misreplicationCount elements from this list is taken to be 
copied to different racks.
   Thus here:
   Replica Index 4 would be copied to Rack 5
   Thus making:
   Rack 1: 1,2
   Rack 2: 3,4
   Rack 3: 5,3
   Rack 4: 1
   Rack 5: 4
   
   Over replication Handler can delete Replica 1 from Rack 1, Replica 3 from 
Rack 3  & Replica 4 from Rack 2 making the spread:
   Rack 1: 2
   Rack 2: 3
   Rack 3: 5
   Rack 4: 1
   Rack 5: 4
    
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to