[
https://issues.apache.org/jira/browse/HUDI-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-4766:
---------------------------------
Labels: pull-request-available (was: )
> Fix HoodieFlinkClusteringJob
> ----------------------------
>
> Key: HUDI-4766
> URL: https://issues.apache.org/jira/browse/HUDI-4766
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: voon
> Assignee: voon
> Priority: Major
> Labels: pull-request-available
>
> h1. Flink Hudi Clustering Issues
>
> # Integer type used for byte size variables instead of long
> ** Maximum size range of 2^31-1 bytes ~2 gigabytes
> # Unable to choose a particular instant to execute
> # Unable to select filter mode as the method that controls this is
> overridden by _FlinkSizeBasedClusteringPlanStrategy#filterPartitionPaths_
> # No cleaning
> ** With reference to OfflineCompaction (HoodieFlinkCompactor), cleaning is
> only enabled if _clean.async.enabled = false._
> # Schedule configuration is not consistent with HoodieFlinkCompactor
> defining the flag = false, which is opposite of HoodieFlinkCompactor
> # Allow props to be passed in using _--props/–hoodie-conf_
> ** Required for passing in configurations like:
> *** _hoodie.parquet.compression.ratio_
> *** Partition filter configurations depending on strategy
> # Clustering group will spit out files of _hoodie.parquet.max.file.size_
> (120MB by default)
> # Multiple clustering jobs can execute, but no fine-grain control over
> restarting jobs that have failed. Current implementation will only filter for
> REQUESTED clustering jobs; rollbacks will never be performed.
> # Removed unused _getNumberOfOutputFileGroups()_ function.
> ** _hoodie.clustering.plan.strategy.small.file.limit_
> ** _hoodie.clustering.plan.strategy.max.bytes.per.group_
> ** _hoodie.clustering.plan.strategy.target.file.max.bytes_
> ** Will create N file groups (1 task will be writing to each file group,
> increasing parallelism)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)