Re: [PR] [WIP][CELEBORN-1319] Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files [celeborn]

via GitHub Thu, 04 Apr 2024 06:04:20 -0700


s0nskar commented on PR #2373:
URL: https://github.com/apache/celeborn/pull/2373#issuecomment-2037156713


   From my understanding, in this PR we're diverting from vanilla spark 
approach based on mapIndex and just dividing the full partition into multiple 
sub-partition based on some heuristics. I'm new to Celeborn code, so might be 
missing something basic but in this PR we're not addressing below issue. If we 
consider a basic scenario where a partial partition read is happening and we 
see a FetchFailure.
   
   `ShuffleMapStage --> ResultStage`
                        
   - ShuffleMapStage (attempt 0) generated [P0, P1, P2] and P0 is skewed with 
partition location [0,1,2,3,4,5].
   - AQE asks for three splits and this PR logic will create three partitions 
[0, 1], [2, 3], [4, 5]
   - Now consider is reducer read [0, 1] and [2, 3] and gets `FetchFailure` 
while reading [4, 5]
   - This will trigger a complete mapper stage retry a/c to this 
[doc](https://docs.google.com/document/d/1dkG6fww3g99VAb1wkphNlUES_MpngVPNg8601chmVp8/edit)
 and will clear the map output corresponding the shuffleID
   - ShuffleMapStage (attempt 0) will again generate data for P0 at different 
partition location [a, b, c, d, e, f] and it will get divided like [a, b], [c, 
d], [e, f]
   - Now if reader stage is `ShuffleMapStage` then it will read every 
sub-partition again but if the reader is `ResultStage` then it will only read 
missing partition data which [e, f].
   
   The data generated on location `1` and location `a` would be different 
because of other factors like network delay (same thing applies for other 
locations). Ex – The data that might be present in 1st location in first 
attempt might be present in 2nd location or any location in different attempt 
because of the order mapper generated the data and in order server received 
that data. 
   
   This can cause both Data loss and Data duplication, this might be getting 
addressed in some other place in the codebase that i'm not aware of but i 
wanted point this problem out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [WIP][CELEBORN-1319] Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files [celeborn]

Reply via email to