Re: [PR] [DRAFT] [FEATURE] Inverted index based reduce partition data writer / reader [celeborn]

via GitHub Sun, 25 May 2025 16:25:58 -0700


saurabhd336 commented on PR #3279:
URL: https://github.com/apache/celeborn/pull/3279#issuecomment-2908151789


   Hi @mridulm 
   
   Sure let me pursue this via CIP!
   
   a) Agreed, random seeks could be slower than sequential, I wanted to propose 
this as an alternative for cases where the sorting overhead may be greater. I 
think a way we can tackle this though is through prefetching i.e. since the 
chunks ids to be fetched are known beforehand, a few chunks can be fetched in 
advance and cached in memory in anticipation of next read request(s). Ofcourse 
this is equally applicable to all partition writer types (sorted / non-sorted / 
inverted index based)
   
   b) Again agreed, i had added some rough estimates in the doc, would work on 
adding more concrete numbers. The overheads are
   1) Header data buffer (8 + 8 + 4 bytes)
   2) Chunk offsets buffer (8 * number of chunks) - To keep this one in check, 
I had thought about aggregating data for a given mapper id in memory chunks of 
say 2mb each before flushing to disk. The "in-memory" part is not that 
important and the buffer may be backed by mmapped temporary files (to keep 
memory usage low). If we follow this model, 10g (default split partition size) 
data file -> 5000 chunks -> 5000 * 8 = 40kb overhead for offsets.
   3) Inverted index buffer has 2 parts
       i) Bitmap offset buffer -> This is deterministic (8 * number of 
mappers). For 10k mappers -> 80kb
       ii) Actual serialized bitmaps -> Proportional to number of mappers. I 
don't have a good way of estimating it yet, but I've seen the serialized size 
of even 10000 bitmaps is pretty low (< 2mb). Let me add some more benchmarks 
and numbers.
   
   Another aspect I'm trying to improve is the need to syncronize the actual 
[writeBufferToFile](https://github.com/apache/celeborn/pull/3279/files#diff-48761c959e94314b53865866b33156b04a97461d2860da333711c6fbef865fbdR108)
 method. Since the exact offsets + chunk id are needed during buffer flush. 
While the fact that we buffer into in-memory chunk buffer before flushing helps 
a little bit, I'm working on trying to improve this.
       
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [DRAFT] [FEATURE] Inverted index based reduce partition data writer / reader [celeborn]

Reply via email to