zbbkeepgoing opened a new issue, #9351: URL: https://github.com/apache/hudi/issues/9351
**Describe the problem you faced** - Our scenario We have 700 million records in our original offline table, distributed across 10 partitions. Each partition has a different data size, ranging from 10GB to 200GB. We plan to ingest this data into a data lake and test the point query performance after applying Clustering. - Point query scenario The original table has a column called "vin," which will be used as a filter along with the time partition column for point queries. - Hudi configuration ``` hoodie.clustering.plan.strategy.target.file.max.bytes is set to 1GB, consistent with Delta Lake's default value. hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy hoodie.clustering.plan.strategy.sort.columns=vin hoodie.clustering.rollback.pending.replacecommit.on.conflict=true hoodie.clustering.plan.strategy.daybased.lookback.partitions=10 hoodie.clustering.plan.partition.filter.mode=SELECTED_PARTITIONS hoodie.clustering.plan.strategy.cluster.begin.partition=part_dt=20230614 hoodie.clustering.plan.strategy.cluster.end.partition=part_dt=20230623 hoodie.clustering.plan.strategy.max.bytes.per.group=17179869184 hoodie.clustering.plan.strategy.max.num.groups=128 hoodie.layout.optimize.enable=true hoodie.layout.optimize.strategy=z-order ``` - Phenomena we observed 1. After Clustering, both Hudi and Delta Lake produce Parquet files of approximately 1GB, with an error margin of around 200MB. 2. With Clustering applied, when performing point queries, Hudi scans around 10 files in partitions with larger data, while Delta Lake typically scans only 1-2 files regardless of the partition. 3. We conducted performance tests with 10 concurrent and 1 concurrent queries. We ran hundreds of rounds of tests on both Hudi and Delta Lake, with different combinations of "vin" and time partition columns. The final conclusion was that Delta Lake performs three times better than Hudi. After examining Hudi's List file code, we found that Hudi primarily uses column statistics (min and max values) to retrieve candidate files. Therefore, we believe that the List file logic itself is unlikely to be the cause of the performance lag. It is highly likely that the issue lies in the Clustering algorithm itself. Can you please analyze from a professional perspective what is the reason behind this? Because it determines which data lake technology we ultimately choose. **Expected behavior** The point query performance after clustering is comparable to Delta Lake. **Environment Description** * Hudi version : 0.13.1 * Spark version : 3.3 * Hive version : 2.3.9 * Hadoop version : 2.x * Storage (HDFS/S3/GCS..) : HDFS * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
