[GitHub] [hudi] zbbkeepgoing commented on issue #9351: [SUPPORT] The point query performance after clustering is lags behind Delta Lake.

via GitHub Tue, 22 Aug 2023 03:17:17 -0700


zbbkeepgoing commented on issue #9351:
URL: https://github.com/apache/hudi/issues/9351#issuecomment-1687907132

> @zbbkeepgoing Ideally delta and hudi should ideally be scanning similar
number of files if both are skipping files due to column stats. Can you confirm
if hudi reading all the files under the time partition column you are using in
the query. It may happen that hudi is not skipping files using col stats at all
and just reading files from entire one partition.
>
> we can also sync up on a huddle on hudi community slack in case you want
to look together. Ping me (Aditya Goenka)

Thank you for your reply. Here is update on the latest test info.

Due to the small value set for
"hoodie.clustering.plan.strategy.max.bytes.per.group" in our configuration, it
resulted in multiple file groups being created for clustering with in each
partition. As a result, the number of files scanned by Hudi is higher than
Delta.

However, after modifications, currently Hudi data after clustering is
approximately 30x faster than the original data, and Delta is around 50X
faster. Therefore, Hudi is still significantly lagging behind Delta.

We tried running the same SQL multiple times in Hudi and Delta, and found
that Hudi is approximately 1.4-1.6s slower than Delta. Moreover, in the DAG, we
observed that Hudi's metadata time is around 1.7s, while Delta's metadata time
is 0s, Of course, Delta has a separate job to scan the Delta log, which takes
approximately 0.3 seconds. I hope this information can be helpful to your guys.

- Hudi

![image](https://github.com/apache/hudi/assets/42572980/07fa4b97-e5d6-4576-b27d-ea5f551038b1)

- Delta

![image](https://github.com/apache/hudi/assets/42572980/f776b449-ab28-4fb3-ba16-e525265759f4)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] zbbkeepgoing commented on issue #9351: [SUPPORT] The point query performance after clustering is lags behind Delta Lake.

Reply via email to