[jira] [Updated] (HUDI-4766) Fix HoodieFlinkClusteringJob

voon (Jira) Thu, 01 Sep 2022 03:13:04 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


voon updated HUDI-4766:
-----------------------
    Description: 
h1. Flink Hudi Clustering Issues

 
 # Integer type used for byte size variables instead of long
 ** Maximum size range of 2^31-1 bytes ~2 gigabytes
 # Unable to choose a particular instant to execute
 # Unable to select filter mode as the method that controls this is overridden 
by _FlinkSizeBasedClusteringPlanStrategy#filterPartitionPaths_
 # No cleaning
 ** With reference to OfflineCompaction (HoodieFlinkCompactor), cleaning is 
only enabled if _clean.async.enabled = false._
 # Schedule configuration is not consistent with HoodieFlinkCompactor defining 
the flag = false, which is opposite of HoodieFlinkCompactor
 # No ability to allow props to be passed in using _--props/–hoodie-conf_
 ** Required for passing in configurations like:
 *** _hoodie.parquet.compression.ratio_
 *** Partition filter configurations depending on strategy
 # Clustering group will spit out files of _hoodie.parquet.max.file.size_ 
(120MB by default)
 # Multiple clustering jobs can execute, but no fine-grain control over 
restarting jobs that have failed. Current implementation will only filter for 
REQUESTED clustering jobs; rollbacks will never be performed.
 # Removed unused _getNumberOfOutputFileGroups()_ function.
 ** _hoodie.clustering.plan.strategy.small.file.limit_
 ** _hoodie.clustering.plan.strategy.max.bytes.per.group_
 ** _hoodie.clustering.plan.strategy.target.file.max.bytes_
 ** Will create N file groups (1 task will be writing to each file group, 
increasing parallelism)

  was:
h1. Flink Hudi Clustering Issues

 
 # Integer type used for byte size variables instead of long
 ** Maximum size range of 2^31-1 bytes ~2 gigabytes
 # Unable to choose a particular instant to execute
 # Unable to select filter mode as the method that controls this is overridden 
by _FlinkSizeBasedClusteringPlanStrategy#filterPartitionPaths_
 # No cleaning
 ** With reference to OfflineCompaction (HoodieFlinkCompactor), cleaning is 
only enabled if _clean.async.enabled = false._
 # Schedule configuration is not consistent with HoodieFlinkCompactor defining 
the flag = false, which is opposite of HoodieFlinkCompactor
 # Allow props to be passed in using _--props/–hoodie-conf_
 ** Required for passing in configurations like:
 *** _hoodie.parquet.compression.ratio_
 *** Partition filter configurations depending on strategy
 # Clustering group will spit out files of _hoodie.parquet.max.file.size_ 
(120MB by default)
 # Multiple clustering jobs can execute, but no fine-grain control over 
restarting jobs that have failed. Current implementation will only filter for 
REQUESTED clustering jobs; rollbacks will never be performed.
 # Removed unused _getNumberOfOutputFileGroups()_ function.
 ** _hoodie.clustering.plan.strategy.small.file.limit_
 ** _hoodie.clustering.plan.strategy.max.bytes.per.group_
 ** _hoodie.clustering.plan.strategy.target.file.max.bytes_
 ** Will create N file groups (1 task will be writing to each file group, 
increasing parallelism)


> Fix HoodieFlinkClusteringJob
> ----------------------------
>
>                 Key: HUDI-4766
>                 URL: https://issues.apache.org/jira/browse/HUDI-4766
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: voon
>            Assignee: voon
>            Priority: Major
>              Labels: pull-request-available
>
> h1. Flink Hudi Clustering Issues
>  
>  # Integer type used for byte size variables instead of long
>  ** Maximum size range of 2^31-1 bytes ~2 gigabytes
>  # Unable to choose a particular instant to execute
>  # Unable to select filter mode as the method that controls this is 
> overridden by _FlinkSizeBasedClusteringPlanStrategy#filterPartitionPaths_
>  # No cleaning
>  ** With reference to OfflineCompaction (HoodieFlinkCompactor), cleaning is 
> only enabled if _clean.async.enabled = false._
>  # Schedule configuration is not consistent with HoodieFlinkCompactor 
> defining the flag = false, which is opposite of HoodieFlinkCompactor
>  # No ability to allow props to be passed in using _--props/–hoodie-conf_
>  ** Required for passing in configurations like:
>  *** _hoodie.parquet.compression.ratio_
>  *** Partition filter configurations depending on strategy
>  # Clustering group will spit out files of _hoodie.parquet.max.file.size_ 
> (120MB by default)
>  # Multiple clustering jobs can execute, but no fine-grain control over 
> restarting jobs that have failed. Current implementation will only filter for 
> REQUESTED clustering jobs; rollbacks will never be performed.
>  # Removed unused _getNumberOfOutputFileGroups()_ function.
>  ** _hoodie.clustering.plan.strategy.small.file.limit_
>  ** _hoodie.clustering.plan.strategy.max.bytes.per.group_
>  ** _hoodie.clustering.plan.strategy.target.file.max.bytes_
>  ** Will create N file groups (1 task will be writing to each file group, 
> increasing parallelism)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4766) Fix HoodieFlinkClusteringJob

Reply via email to