Re: [PR] [WIP][CELEBORN-1319] Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files [celeborn]

via GitHub Sun, 14 Apr 2024 00:52:57 -0700


wangshengjie123 commented on code in PR #2373:
URL: https://github.com/apache/celeborn/pull/2373#discussion_r1564540946



##########
client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java:
##########
@@ -135,30 +136,37 @@ protected Compressor initialValue() {
 
   private final ReviveManager reviveManager;
 
+  private final boolean dataPushFailureTrackingEnabled;
+
   protected static class ReduceFileGroups {
     public Map<Integer, Set<PartitionLocation>> partitionGroups;
+    public Map<String, Set<PushFailedBatch>> pushFailedBatches;

Review Comment:
   I'm not quite sure if I fully understand your meaning, do you mean: 
   - `the data would be under replicated`, this means data will be lost ? I 
think data will not be lost, because failed batch will be retried to another 
PartitionLocation requested by Revive with epoch + 1 
   - `If yes, wouldn't it not be simply better to retry to the same (or perhaps 
updated) peer ?` Revive is designed to avoid retry push data to the same 
PartitionLocation, maybe sometimes retry the same node will not success or 
timeout. And we want to ensure that one Succeed batch should be available both 
Primary and Replicate Peer
   
   If I misunderstood, could you kindly explain further and provide an example?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [WIP][CELEBORN-1319] Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files [celeborn]

Reply via email to