sivabalan narayanan created HUDI-2550:
-----------------------------------------

             Summary: Add support to configure no of small files to consider 
with MOR
                 Key: HUDI-2550
                 URL: https://issues.apache.org/jira/browse/HUDI-2550
             Project: Apache Hudi
          Issue Type: Improvement
          Components: Writer Core
            Reporter: sivabalan narayanan


Looks like in MOR, when an index is used which cannot index log files (which is 
the case for all out of box indexes in hudi), we just choose the smallest 
parquet file for every commit. So, over time, every file will grow to become 
fullest is the idea here. In other words, only one small file will be bin 
backed per commit even though there could be more. 

source 
[link|https://github.com/apache/hudi/blob/3354fac42f9a2c4dbc8ac73ca4749160e9b9459b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/deltacommit/SparkUpsertDeltaCommitPartitioner.java#L66]

 

We can add a config which can control the total number of files considered as 
small files for MOR table when index which cannot index log files are used. 

We can leave the default value to 1 (current behavior). But for interested 
users, this should be flexible. 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to