[jira] [Updated] (HUDI-4766) Fix HoodieFlinkClusteringJob

ASF GitHub Bot (Jira) Thu, 01 Sep 2022 03:00:28 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated HUDI-4766:
---------------------------------
    Labels: pull-request-available  (was: )

> Fix HoodieFlinkClusteringJob
> ----------------------------
>
>                 Key: HUDI-4766
>                 URL: https://issues.apache.org/jira/browse/HUDI-4766
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: voon
>            Assignee: voon
>            Priority: Major
>              Labels: pull-request-available
>
> h1. Flink Hudi Clustering Issues
>  
>  # Integer type used for byte size variables instead of long
>  ** Maximum size range of 2^31-1 bytes ~2 gigabytes
>  # Unable to choose a particular instant to execute
>  # Unable to select filter mode as the method that controls this is 
> overridden by _FlinkSizeBasedClusteringPlanStrategy#filterPartitionPaths_
>  # No cleaning
>  ** With reference to OfflineCompaction (HoodieFlinkCompactor), cleaning is 
> only enabled if _clean.async.enabled = false._
>  # Schedule configuration is not consistent with HoodieFlinkCompactor 
> defining the flag = false, which is opposite of HoodieFlinkCompactor
>  # Allow props to be passed in using _--props/–hoodie-conf_
>  ** Required for passing in configurations like:
>  *** _hoodie.parquet.compression.ratio_
>  *** Partition filter configurations depending on strategy
>  # Clustering group will spit out files of _hoodie.parquet.max.file.size_ 
> (120MB by default)
>  # Multiple clustering jobs can execute, but no fine-grain control over 
> restarting jobs that have failed. Current implementation will only filter for 
> REQUESTED clustering jobs; rollbacks will never be performed.
>  # Removed unused _getNumberOfOutputFileGroups()_ function.
>  ** _hoodie.clustering.plan.strategy.small.file.limit_
>  ** _hoodie.clustering.plan.strategy.max.bytes.per.group_
>  ** _hoodie.clustering.plan.strategy.target.file.max.bytes_
>  ** Will create N file groups (1 task will be writing to each file group, 
> increasing parallelism)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4766) Fix HoodieFlinkClusteringJob

Reply via email to