VitoMakarevich commented on issue #7734:
URL: https://github.com/apache/hudi/issues/7734#issuecomment-1402495885

   Thank you for looking into it! We have a few flows, but let me describe one 
that I'm debugging now. The flow is that we have hudi table populated by the 
streaming job(spark), then the second job(batch) runs every N hours and reads 
all updates since the previous offset by loading hudi snapshot & doing 
filter(same as hudi condition from previous offset to now). Our job has bloom 
index range pruning on, and our target size of datafile is 128mb, we are 
running a cdc workload, so 
   I debugged 1 particular run(from s3 logs as you suggested since thought to 
do the same initially) - it was 2 commits(clean and commit), during that 
run(filtered by time) the job run Get request to 296 unique files(here probably 
all files like markers/commitline/data/else), it issued ~30k get requests, it 
was factually 38k updates and 17k inserts. Since all get calls are range 
requests, I calculated that of that 30k requests, 25k was less than 100 KB in 
size, 1.5k is 100-200 KB.
   Let me know if you need any kind of additional information. In the meantime 
I'll continue searching and will write here all the details I'll consider 
important.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to