voon created HUDI-4766:
--------------------------

             Summary: Fix HoodieFlinkClusteringJob
                 Key: HUDI-4766
                 URL: https://issues.apache.org/jira/browse/HUDI-4766
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: voon
            Assignee: voon


h1. Flink Hudi Clustering Issues

 
 # Integer type used for byte size variables instead of long
 ** Maximum size range of 2^31-1 bytes ~2 gigabytes
 # Unable to choose a particular instant to execute
 # Unable to select filter mode as the method that controls this is overridden 
by _FlinkSizeBasedClusteringPlanStrategy#filterPartitionPaths_
 # No cleaning
 ** With reference to OfflineCompaction (HoodieFlinkCompactor), cleaning is 
only enabled if _clean.async.enabled = false._
 # Schedule configuration is not consistent with HoodieFlinkCompactor defining 
the flag = false, which is opposite of HoodieFlinkCompactor
 # Allow props to be passed in using _--props/–hoodie-conf_
 ** Required for passing in configurations like:
 *** _hoodie.parquet.compression.ratio_
 *** Partition filter configurations depending on strategy
 # Clustering group will spit out files of _hoodie.parquet.max.file.size_ 
(120MB by default)
 # Multiple clustering jobs can execute, but no fine-grain control over 
restarting jobs that have failed. Current implementation will only filter for 
REQUESTED clustering jobs; rollbacks will never be performed.
 # Removed unused _getNumberOfOutputFileGroups()_ function.
 ** _hoodie.clustering.plan.strategy.small.file.limit_
 ** _hoodie.clustering.plan.strategy.max.bytes.per.group_
 ** _hoodie.clustering.plan.strategy.target.file.max.bytes_
 ** Will create N file groups (1 task will be writing to each file group, 
increasing parallelism)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to