VitoMakarevich commented on issue #7734: URL: https://github.com/apache/hudi/issues/7734#issuecomment-1402495885
Thank you for looking into it! We have a few flows, but let me describe one that I'm debugging now. The flow is that we have hudi table populated by the streaming job(spark), then the second job(batch) runs every N hours and reads all updates since the previous offset by loading hudi snapshot & doing filter(same as hudi condition from previous offset to now). Our job has bloom index range pruning on, and our target size of datafile is 128mb, we are running a cdc workload, so I debugged 1 particular run(from s3 logs as you suggested since thought to do the same initially) - it was 2 commits(clean and commit), during that run(filtered by time) the job run Get request to 296 unique files(here probably all files like markers/commitline/data/else), it issued ~30k get requests, it was factually 38k updates and 17k inserts. Since all get calls are range requests, I calculated that of that 30k requests, 25k was less than 100 KB in size, 1.5k is 100-200 KB. Let me know if you need any kind of additional information. In the meantime I'll continue searching and will write here all the details I'll consider important. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
