jtmzheng commented on issue #2470: URL: https://github.com/apache/hudi/issues/2470#issuecomment-769948718
Sorry for the delay, I believe the slowness was because compaction wasn't keeping up with the number of files (we partition by date and we have many partitions updated with a small number of updates with most coming in current date) and file count was growing faster than dataset size. I saw that IO was bounded on compaction and its based on log file size so the smaller updates were never getting compacted. I've since increased the IO bound 3x and performance is slowly improving as file count goes down (ie. getting small files stage is faster). I'll update once we test rollback performance but I suspect it will also be better. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
