voon created HUDI-4766:
--------------------------
Summary: Fix HoodieFlinkClusteringJob
Key: HUDI-4766
URL: https://issues.apache.org/jira/browse/HUDI-4766
Project: Apache Hudi
Issue Type: Bug
Reporter: voon
Assignee: voon
h1. Flink Hudi Clustering Issues
# Integer type used for byte size variables instead of long
** Maximum size range of 2^31-1 bytes ~2 gigabytes
# Unable to choose a particular instant to execute
# Unable to select filter mode as the method that controls this is overridden
by _FlinkSizeBasedClusteringPlanStrategy#filterPartitionPaths_
# No cleaning
** With reference to OfflineCompaction (HoodieFlinkCompactor), cleaning is
only enabled if _clean.async.enabled = false._
# Schedule configuration is not consistent with HoodieFlinkCompactor defining
the flag = false, which is opposite of HoodieFlinkCompactor
# Allow props to be passed in using _--props/–hoodie-conf_
** Required for passing in configurations like:
*** _hoodie.parquet.compression.ratio_
*** Partition filter configurations depending on strategy
# Clustering group will spit out files of _hoodie.parquet.max.file.size_
(120MB by default)
# Multiple clustering jobs can execute, but no fine-grain control over
restarting jobs that have failed. Current implementation will only filter for
REQUESTED clustering jobs; rollbacks will never be performed.
# Removed unused _getNumberOfOutputFileGroups()_ function.
** _hoodie.clustering.plan.strategy.small.file.limit_
** _hoodie.clustering.plan.strategy.max.bytes.per.group_
** _hoodie.clustering.plan.strategy.target.file.max.bytes_
** Will create N file groups (1 task will be writing to each file group,
increasing parallelism)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)