[
https://issues.apache.org/jira/browse/HUDI-5496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-5496:
---------------------------------
Labels: pull-request-available (was: )
> Prevent Hudi from generating clustering plans with filegroups consisting of
> only 1 fileSlice
> --------------------------------------------------------------------------------------------
>
> Key: HUDI-5496
> URL: https://issues.apache.org/jira/browse/HUDI-5496
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: voon
> Priority: Major
> Labels: pull-request-available
>
> Suppose a partition is no longer being written/updated, i.e. there will be no
> changes to the partition, therefore, size of parquet files will always be the
> same.
>
> If the parquet files in the partition (even after prior clustering) is
> smaller than {*}hoodie.clustering.plan.strategy.small.file.limit{*}, the
> fileSlice will always be returned as a candidate for
> {_}getFileSlicesEligibleForClustering(){_}.
>
> This may cause inputGroups with only 1 fileSlice to be selected as candidates
> for clustering. An of a clusteringPlan demonstrating such a case in JSON
> format is seen below.
>
>
> {code:java}
> {
> "inputGroups": [
> {
> "slices": [
> {
> "dataFilePath":
> "/path/clustering_test_table/dt=2023-01-03/cf2929a7-78dc-4e99-be0c-926e9487187d-0_0-2-0_20230104102201656.parquet",
> "deltaFilePaths": [],
> "fileId": "cf2929a7-78dc-4e99-be0c-926e9487187d-0",
> "partitionPath": "dt=2023-01-03",
> "bootstrapFilePath": "",
> "version": 1
> }
> ],
> "metrics": {
> "TOTAL_LOG_FILES": 0.0,
> "TOTAL_IO_MB": 260.0,
> "TOTAL_IO_READ_MB": 130.0,
> "TOTAL_LOG_FILES_SIZE": 0.0,
> "TOTAL_IO_WRITE_MB": 130.0
> },
> "numOutputFileGroups": 1,
> "extraMetadata": null,
> "version": 1
> },
> {
> "slices": [
> {
> "dataFilePath":
> "/path/clustering_test_table/dt=2023-01-04/b101162e-4813-4de6-9881-4ee0ff918f32-0_0-2-0_20230104103401458.parquet",
> "deltaFilePaths": [],
> "fileId": "b101162e-4813-4de6-9881-4ee0ff918f32-0",
> "partitionPath": "dt=2023-01-04",
> "bootstrapFilePath": "",
> "version": 1
> },
> {
> "dataFilePath":
> "/path/clustering_test_table/dt=2023-01-04/9b1c1494-2a58-43f1-890d-4b52070937b1-0_0-2-0_20230104102201656.parquet",
> "deltaFilePaths": [],
> "fileId": "9b1c1494-2a58-43f1-890d-4b52070937b1-0",
> "partitionPath": "dt=2023-01-04",
> "bootstrapFilePath": "",
> "version": 1
> }
> ],
> "metrics": {
> "TOTAL_LOG_FILES": 0.0,
> "TOTAL_IO_MB": 418.0,
> "TOTAL_IO_READ_MB": 209.0,
> "TOTAL_LOG_FILES_SIZE": 0.0,
> "TOTAL_IO_WRITE_MB": 209.0
> },
> "numOutputFileGroups": 1,
> "extraMetadata": null,
> "version": 1
> }
> ],
> "strategy": {
> "strategyClassName":
> "org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy",
> "strategyParams": {},
> "version": 1
> },
> "extraMetadata": {},
> "version": 1,
> "preserveHoodieMetadata": true
> }{code}
>
> Such a case will cause performance issues as a parquet file is re-written
> unnecessarily (write amplification).
>
> The fix is to only select inputGroups with more than 1 fileSlice as
> candidates for clustering.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)