gudladona commented on PR #18241:
URL: https://github.com/apache/hudi/pull/18241#issuecomment-3961255857

   > Hi @gudladona Thanks for this contribution! In general, there are two 
questions that I wonder if you could elaborate on:
   > 
   > 1. Whole-File In-Memory Processing: Implemented a "Read Whole File" 
strategy for files smaller than 2GB. Do we need to cache the entire file here, 
or is IO at the fg granularity sufficient? This is mainly a consideration of 
memory pressure.
   
   I'll assume you mean row group, not file group(fg). I cached the entire file 
as the total file read is necessary regardless; in the previous implementation, 
a read at column chunk put enormous pressure on the s3a client, causing 
timeouts. This reduces 1000s of Get(byRange) to a few S3 ops depending on the 
file size. I thought of implementing this at the rowgroup level, then there 
will be additional range gets for file metadata, which is another IO that has 
to happen anyway. So, my thinking is whole object get will amortize the value 
of the whole function. 
   
   > 2. Double-Buffer: Do we definitely need this Double-Buffer? For binary 
copy, the CPU pressure itself is relatively low, and the overall bottleneck 
lies in the IO interaction with remote storage. It seems that using a double 
buffer for caching here is not of great practical significance.
   
   Double-Buffer is an optimization that lets a background thread keep the next 
file "Ready" as there is nothing IO-bound during the copy operation. When the 
source file itself is large with multiple rowgroups and 1000s of column chunks 
(true in our case) this concurrent operation was helpful squeeze additional 
performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to