[jira] [Updated] (HUDI-2550) Add support to configure no of small files to consider with MOR

Raymond Xu (Jira) Wed, 20 Oct 2021 22:36:07 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Raymond Xu updated HUDI-2550:
-----------------------------
    Description: 
Looks like in MOR, when an index is used which cannot index log files (which is 
the case for all out of box indexes in hudi), we just choose the smallest 
parquet file for every commit. So, over time, every file will grow to become 
fullest is the idea here. In other words, only one small file will be bin 
backed per commit even though there could be more. 

source 
[link|https://github.com/apache/hudi/blob/3354fac42f9a2c4dbc8ac73ca4749160e9b9459b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/deltacommit/SparkUpsertDeltaCommitPartitioner.java#L66]

 

We can add a config which can control the total number of files considered as 
small files for MOR table when index which cannot index log files are used. 

We can leave the default value to 1 (current behavior). But for interested 
users, this should be flexible. 

 

Original issue

https://github.com/apache/hudi/issues/3676 

 

 

  was:
Looks like in MOR, when an index is used which cannot index log files (which is 
the case for all out of box indexes in hudi), we just choose the smallest 
parquet file for every commit. So, over time, every file will grow to become 
fullest is the idea here. In other words, only one small file will be bin 
backed per commit even though there could be more. 

source 
[link|https://github.com/apache/hudi/blob/3354fac42f9a2c4dbc8ac73ca4749160e9b9459b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/deltacommit/SparkUpsertDeltaCommitPartitioner.java#L66]

 

We can add a config which can control the total number of files considered as 
small files for MOR table when index which cannot index log files are used. 

We can leave the default value to 1 (current behavior). But for interested 
users, this should be flexible. 

 

 

 


> Add support to configure no of small files to consider with MOR
> ---------------------------------------------------------------
>
>                 Key: HUDI-2550
>                 URL: https://issues.apache.org/jira/browse/HUDI-2550
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Writer Core
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>              Labels: sev:critical, user-support-issues
>
> Looks like in MOR, when an index is used which cannot index log files (which 
> is the case for all out of box indexes in hudi), we just choose the smallest 
> parquet file for every commit. So, over time, every file will grow to become 
> fullest is the idea here. In other words, only one small file will be bin 
> backed per commit even though there could be more. 
> source 
> [link|https://github.com/apache/hudi/blob/3354fac42f9a2c4dbc8ac73ca4749160e9b9459b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/deltacommit/SparkUpsertDeltaCommitPartitioner.java#L66]
>  
> We can add a config which can control the total number of files considered as 
> small files for MOR table when index which cannot index log files are used. 
> We can leave the default value to 1 (current behavior). But for interested 
> users, this should be flexible. 
>  
> Original issue
> https://github.com/apache/hudi/issues/3676 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2550) Add support to configure no of small files to consider with MOR

Reply via email to