[jira] [Updated] (HUDI-5496) Prevent Hudi from generating clustering plans with filegroups consisting of only 1 fileSlice

ASF GitHub Bot (Jira) Tue, 03 Jan 2023 20:25:07 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-5496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated HUDI-5496:
---------------------------------
    Labels: pull-request-available  (was: )

> Prevent Hudi from generating clustering plans with filegroups consisting of 
> only 1 fileSlice
> --------------------------------------------------------------------------------------------
>
>                 Key: HUDI-5496
>                 URL: https://issues.apache.org/jira/browse/HUDI-5496
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: voon
>            Priority: Major
>              Labels: pull-request-available
>
> Suppose a partition is no longer being written/updated, i.e. there will be no 
> changes to the partition, therefore, size of parquet files will always be the 
> same. 
>  
> If the parquet files in the partition (even after prior clustering) is 
> smaller than {*}hoodie.clustering.plan.strategy.small.file.limit{*}, the 
> fileSlice will always be returned as a candidate for 
> {_}getFileSlicesEligibleForClustering(){_}.
>  
> This may cause inputGroups with only 1 fileSlice to be selected as candidates 
> for clustering. An of a clusteringPlan demonstrating such a case in JSON 
> format is seen below.
>  
>  
> {code:java}
> {
>   "inputGroups": [
>     {
>       "slices": [
>         {
>           "dataFilePath": 
> "/path/clustering_test_table/dt=2023-01-03/cf2929a7-78dc-4e99-be0c-926e9487187d-0_0-2-0_20230104102201656.parquet",
>           "deltaFilePaths": [],
>           "fileId": "cf2929a7-78dc-4e99-be0c-926e9487187d-0",
>           "partitionPath": "dt=2023-01-03",
>           "bootstrapFilePath": "",
>           "version": 1
>         }
>       ],
>       "metrics": {
>         "TOTAL_LOG_FILES": 0.0,
>         "TOTAL_IO_MB": 260.0,
>         "TOTAL_IO_READ_MB": 130.0,
>         "TOTAL_LOG_FILES_SIZE": 0.0,
>         "TOTAL_IO_WRITE_MB": 130.0
>       },
>       "numOutputFileGroups": 1,
>       "extraMetadata": null,
>       "version": 1
>     },
>     {
>       "slices": [
>         {
>           "dataFilePath": 
> "/path/clustering_test_table/dt=2023-01-04/b101162e-4813-4de6-9881-4ee0ff918f32-0_0-2-0_20230104103401458.parquet",
>           "deltaFilePaths": [],
>           "fileId": "b101162e-4813-4de6-9881-4ee0ff918f32-0",
>           "partitionPath": "dt=2023-01-04",
>           "bootstrapFilePath": "",
>           "version": 1
>         },
>         {
>           "dataFilePath": 
> "/path/clustering_test_table/dt=2023-01-04/9b1c1494-2a58-43f1-890d-4b52070937b1-0_0-2-0_20230104102201656.parquet",
>           "deltaFilePaths": [],
>           "fileId": "9b1c1494-2a58-43f1-890d-4b52070937b1-0",
>           "partitionPath": "dt=2023-01-04",
>           "bootstrapFilePath": "",
>           "version": 1
>         }
>       ],
>       "metrics": {
>         "TOTAL_LOG_FILES": 0.0,
>         "TOTAL_IO_MB": 418.0,
>         "TOTAL_IO_READ_MB": 209.0,
>         "TOTAL_LOG_FILES_SIZE": 0.0,
>         "TOTAL_IO_WRITE_MB": 209.0
>       },
>       "numOutputFileGroups": 1,
>       "extraMetadata": null,
>       "version": 1
>     }
>   ],
>   "strategy": {
>     "strategyClassName": 
> "org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy",
>     "strategyParams": {},
>     "version": 1
>   },
>   "extraMetadata": {},
>   "version": 1,
>   "preserveHoodieMetadata": true
> }{code}
>  
> Such a case will cause performance issues as a parquet file is re-written 
> unnecessarily (write amplification). 
>  
> The fix is to only select inputGroups with more than 1 fileSlice as 
> candidates for clustering.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5496) Prevent Hudi from generating clustering plans with filegroups consisting of only 1 fileSlice

Reply via email to