zbbkeepgoing commented on issue #9351: URL: https://github.com/apache/hudi/issues/9351#issuecomment-1687907132
> @zbbkeepgoing Ideally delta and hudi should ideally be scanning similar number of files if both are skipping files due to column stats. Can you confirm if hudi reading all the files under the time partition column you are using in the query. It may happen that hudi is not skipping files using col stats at all and just reading files from entire one partition. > > we can also sync up on a huddle on hudi community slack in case you want to look together. Ping me (Aditya Goenka) Thank you for your reply. Here is update on the latest test info. Due to the small value set for "hoodie.clustering.plan.strategy.max.bytes.per.group" in our configuration, it resulted in multiple file groups being created for clustering with in each partition. As a result, the number of files scanned by Hudi is higher than Delta. However, after modifications, currently Hudi data after clustering is approximately 30x faster than the original data, and Delta is around 50X faster. Therefore, Hudi is still significantly lagging behind Delta. We tried running the same SQL multiple times in Hudi and Delta, and found that Hudi is approximately 1.4-1.6s slower than Delta. Moreover, in the DAG, we observed that Hudi's metadata time is around 1.7s, while Delta's metadata time is 0s, Of course, Delta has a separate job to scan the Delta log, which takes approximately 0.3 seconds. I hope this information can be helpful to your guys. - Hudi  - Delta  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
