jklim96 commented on issue #7254: URL: https://github.com/apache/hudi/issues/7254#issuecomment-1325826748
For the 10 commit query which the incremental query took longer for, I've manually checked the files and confirmed that ~1600 files have been touched out of a total of ~14000 files. so the incremental query should theoretically be ~9x faster than the filter, yet we're seeing performance of incremental queries worse than the filter. From the [Hudi documentation](https://hudi.apache.org/docs/faq/#what-performance-can-i-expect-for-hudi-readingqueries): >For incremental views, you can expect speed up relative to how much data usually changes in a given time window and how much time your entire scan takes. For e.g, if only 100 files changed in the last hour in a partition of 1000 files, then you can expect a speed of 10x using incremental pull in Hudi compared to full scanning the partition to find out new data. What's being explained in the documentation isn't quite the behaviour we're seeing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
