zbbkeepgoing opened a new issue, #9351:
URL: https://github.com/apache/hudi/issues/9351

   **Describe the problem you faced**
   
   - Our scenario
   
   We have 700 million records in our original offline table, distributed 
across 10 partitions. Each partition has a different data size, ranging from 
10GB to 200GB.  We plan to ingest this data into a data lake and test the point 
query performance after applying Clustering.
   
   - Point query scenario
   
   The original table has a column called "vin," which will be used as a filter 
along with the time partition column for point queries.
   
   - Hudi configuration
   
   ```
   hoodie.clustering.plan.strategy.target.file.max.bytes is set to 1GB, 
consistent with Delta Lake's default value.
   
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
   hoodie.clustering.plan.strategy.sort.columns=vin
   hoodie.clustering.rollback.pending.replacecommit.on.conflict=true
   hoodie.clustering.plan.strategy.daybased.lookback.partitions=10
   hoodie.clustering.plan.partition.filter.mode=SELECTED_PARTITIONS
   hoodie.clustering.plan.strategy.cluster.begin.partition=part_dt=20230614
   hoodie.clustering.plan.strategy.cluster.end.partition=part_dt=20230623
   hoodie.clustering.plan.strategy.max.bytes.per.group=17179869184
   hoodie.clustering.plan.strategy.max.num.groups=128
   hoodie.layout.optimize.enable=true
   hoodie.layout.optimize.strategy=z-order
   ```
   
   - Phenomena we observed
   
   1. After Clustering, both Hudi and Delta Lake produce Parquet files of 
approximately 1GB, with an error margin of around 200MB.
   
   2. With Clustering applied, when performing point queries, Hudi scans around 
10 files in partitions with larger data, while Delta Lake typically scans only 
1-2 files regardless of the partition.
   
   3. We conducted performance tests with 10 concurrent and 1 concurrent 
queries. We ran hundreds of rounds of tests on both Hudi and Delta Lake, with 
different combinations of "vin" and time partition columns. The final 
conclusion was that Delta Lake performs three times better than Hudi.
   
   After examining Hudi's List file code, we found that Hudi primarily uses 
column statistics (min and max values) to retrieve candidate files. Therefore, 
we believe that the List file logic itself is unlikely to be the cause of the 
performance lag. It is highly likely that the issue lies in the Clustering 
algorithm itself.
   
   Can you please analyze from a professional perspective what is the reason 
behind this? Because it determines which data lake technology we ultimately 
choose.
   
   **Expected behavior**
   
   The point query performance after clustering is comparable to Delta Lake.
   
   **Environment Description**
   
   * Hudi version : 0.13.1
   
   * Spark version : 3.3
   
   * Hive version :  2.3.9
   
   * Hadoop version :  2.x
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to