[
https://issues.apache.org/jira/browse/HUDI-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
voon updated HUDI-4766:
-----------------------
Description:
h1. Flink Hudi Clustering Issues
# Integer type used for byte size variables instead of long
** Maximum size range of 2^31-1 bytes ~2 gigabytes
# Unable to choose a particular instant to execute
# Unable to select filter mode as the method that controls this is overridden
by _FlinkSizeBasedClusteringPlanStrategy#filterPartitionPaths_
# No cleaning
** With reference to OfflineCompaction (HoodieFlinkCompactor), cleaning is
only enabled if _clean.async.enabled = false._
# Schedule configuration is not consistent with HoodieFlinkCompactor defining
the flag = false, which is opposite of HoodieFlinkCompactor
# No ability to allow props to be passed in using _--props/–hoodie-conf_
** Required for passing in configurations like:
*** _hoodie.parquet.compression.ratio_
*** Partition filter configurations depending on strategy
# Clustering group will spit out files of _hoodie.parquet.max.file.size_
(120MB by default)
# Multiple clustering jobs can execute, but no fine-grain control over
restarting jobs that have failed. Current implementation will only filter for
REQUESTED clustering jobs; rollbacks will never be performed.
# Removed unused _getNumberOfOutputFileGroups()_ function.
** _hoodie.clustering.plan.strategy.small.file.limit_
** _hoodie.clustering.plan.strategy.max.bytes.per.group_
** _hoodie.clustering.plan.strategy.target.file.max.bytes_
** Will create N file groups (1 task will be writing to each file group,
increasing parallelism)
was:
h1. Flink Hudi Clustering Issues
# Integer type used for byte size variables instead of long
** Maximum size range of 2^31-1 bytes ~2 gigabytes
# Unable to choose a particular instant to execute
# Unable to select filter mode as the method that controls this is overridden
by _FlinkSizeBasedClusteringPlanStrategy#filterPartitionPaths_
# No cleaning
** With reference to OfflineCompaction (HoodieFlinkCompactor), cleaning is
only enabled if _clean.async.enabled = false._
# Schedule configuration is not consistent with HoodieFlinkCompactor defining
the flag = false, which is opposite of HoodieFlinkCompactor
# Allow props to be passed in using _--props/–hoodie-conf_
** Required for passing in configurations like:
*** _hoodie.parquet.compression.ratio_
*** Partition filter configurations depending on strategy
# Clustering group will spit out files of _hoodie.parquet.max.file.size_
(120MB by default)
# Multiple clustering jobs can execute, but no fine-grain control over
restarting jobs that have failed. Current implementation will only filter for
REQUESTED clustering jobs; rollbacks will never be performed.
# Removed unused _getNumberOfOutputFileGroups()_ function.
** _hoodie.clustering.plan.strategy.small.file.limit_
** _hoodie.clustering.plan.strategy.max.bytes.per.group_
** _hoodie.clustering.plan.strategy.target.file.max.bytes_
** Will create N file groups (1 task will be writing to each file group,
increasing parallelism)
> Fix HoodieFlinkClusteringJob
> ----------------------------
>
> Key: HUDI-4766
> URL: https://issues.apache.org/jira/browse/HUDI-4766
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: voon
> Assignee: voon
> Priority: Major
> Labels: pull-request-available
>
> h1. Flink Hudi Clustering Issues
>
> # Integer type used for byte size variables instead of long
> ** Maximum size range of 2^31-1 bytes ~2 gigabytes
> # Unable to choose a particular instant to execute
> # Unable to select filter mode as the method that controls this is
> overridden by _FlinkSizeBasedClusteringPlanStrategy#filterPartitionPaths_
> # No cleaning
> ** With reference to OfflineCompaction (HoodieFlinkCompactor), cleaning is
> only enabled if _clean.async.enabled = false._
> # Schedule configuration is not consistent with HoodieFlinkCompactor
> defining the flag = false, which is opposite of HoodieFlinkCompactor
> # No ability to allow props to be passed in using _--props/–hoodie-conf_
> ** Required for passing in configurations like:
> *** _hoodie.parquet.compression.ratio_
> *** Partition filter configurations depending on strategy
> # Clustering group will spit out files of _hoodie.parquet.max.file.size_
> (120MB by default)
> # Multiple clustering jobs can execute, but no fine-grain control over
> restarting jobs that have failed. Current implementation will only filter for
> REQUESTED clustering jobs; rollbacks will never be performed.
> # Removed unused _getNumberOfOutputFileGroups()_ function.
> ** _hoodie.clustering.plan.strategy.small.file.limit_
> ** _hoodie.clustering.plan.strategy.max.bytes.per.group_
> ** _hoodie.clustering.plan.strategy.target.file.max.bytes_
> ** Will create N file groups (1 task will be writing to each file group,
> increasing parallelism)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)