[GitHub] [hudi] zbbkeepgoing opened a new issue, #9351: [SUPPORT] The point query performance after clustering is lags behind Delta Lake.

via GitHub Wed, 02 Aug 2023 20:14:13 -0700


zbbkeepgoing opened a new issue, #9351:
URL: https://github.com/apache/hudi/issues/9351

**Describe the problem you faced**

- Our scenario

We have 700 million records in our original offline table, distributed
across 10 partitions. Each partition has a different data size, ranging from
10GB to 200GB. We plan to ingest this data into a data lake and test the point
query performance after applying Clustering.

- Point query scenario

The original table has a column called "vin," which will be used as a filter
along with the time partition column for point queries.

- Hudi configuration

```
hoodie.clustering.plan.strategy.target.file.max.bytes is set to 1GB,
consistent with Delta Lake's default value.

hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
hoodie.clustering.plan.strategy.sort.columns=vin
hoodie.clustering.rollback.pending.replacecommit.on.conflict=true
hoodie.clustering.plan.strategy.daybased.lookback.partitions=10
hoodie.clustering.plan.partition.filter.mode=SELECTED_PARTITIONS
hoodie.clustering.plan.strategy.cluster.begin.partition=part_dt=20230614
hoodie.clustering.plan.strategy.cluster.end.partition=part_dt=20230623
hoodie.clustering.plan.strategy.max.bytes.per.group=17179869184
hoodie.clustering.plan.strategy.max.num.groups=128
hoodie.layout.optimize.enable=true
hoodie.layout.optimize.strategy=z-order
```

- Phenomena we observed

1. After Clustering, both Hudi and Delta Lake produce Parquet files of
approximately 1GB, with an error margin of around 200MB.

2. With Clustering applied, when performing point queries, Hudi scans around
10 files in partitions with larger data, while Delta Lake typically scans only
1-2 files regardless of the partition.

3. We conducted performance tests with 10 concurrent and 1 concurrent
queries. We ran hundreds of rounds of tests on both Hudi and Delta Lake, with
different combinations of "vin" and time partition columns. The final
conclusion was that Delta Lake performs three times better than Hudi.

After examining Hudi's List file code, we found that Hudi primarily uses
column statistics (min and max values) to retrieve candidate files. Therefore,
we believe that the List file logic itself is unlikely to be the cause of the
performance lag. It is highly likely that the issue lies in the Clustering
algorithm itself.

Can you please analyze from a professional perspective what is the reason
behind this? Because it determines which data lake technology we ultimately
choose.

**Expected behavior**

The point query performance after clustering is comparable to Delta Lake.

**Environment Description**

* Hudi version : 0.13.1

* Spark version : 3.3

* Hive version : 2.3.9

* Hadoop version : 2.x

* Storage (HDFS/S3/GCS..) : HDFS

* Running on Docker? (yes/no) : no

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] zbbkeepgoing opened a new issue, #9351: [SUPPORT] The point query performance after clustering is lags behind Delta Lake.

Reply via email to