zbbkeepgoing commented on issue #9351:
URL: https://github.com/apache/hudi/issues/9351#issuecomment-1687907132

   > @zbbkeepgoing Ideally delta and hudi should ideally be scanning similar 
number of files if both are skipping files due to column stats. Can you confirm 
if hudi reading all the files under the time partition column you are using in 
the query. It may happen that hudi is not skipping files using col stats at all 
and just reading files from entire one partition.
   > 
   > we can also sync up on a huddle on hudi community slack in case you want 
to look together. Ping me (Aditya Goenka)
   
   
   Thank you for your reply. Here is update on the latest test info. 
   
   Due to the small value set for 
"hoodie.clustering.plan.strategy.max.bytes.per.group" in our configuration, it 
resulted in multiple file groups being created for clustering with in each 
partition. As a result, the number of files scanned by Hudi is higher than 
Delta.
   
   However, after modifications, currently Hudi data after clustering is 
approximately 30x faster than the original data, and Delta is around 50X 
faster. Therefore, Hudi is still significantly lagging behind Delta.
   
   We tried running the same SQL multiple times in Hudi and Delta, and found 
that Hudi is approximately 1.4-1.6s slower than Delta. Moreover, in the DAG, we 
observed that Hudi's metadata time is around 1.7s, while Delta's metadata time 
is 0s, Of course, Delta has a separate job to scan the Delta log, which takes 
approximately 0.3 seconds. I hope this information can be helpful to your guys.
   
   - Hudi
   
   
![image](https://github.com/apache/hudi/assets/42572980/07fa4b97-e5d6-4576-b27d-ea5f551038b1)
   
   - Delta
   
   
![image](https://github.com/apache/hudi/assets/42572980/f776b449-ab28-4fb3-ba16-e525265759f4)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to