sivabalan narayanan created HUDI-2550:
-----------------------------------------
Summary: Add support to configure no of small files to consider
with MOR
Key: HUDI-2550
URL: https://issues.apache.org/jira/browse/HUDI-2550
Project: Apache Hudi
Issue Type: Improvement
Components: Writer Core
Reporter: sivabalan narayanan
Looks like in MOR, when an index is used which cannot index log files (which is
the case for all out of box indexes in hudi), we just choose the smallest
parquet file for every commit. So, over time, every file will grow to become
fullest is the idea here. In other words, only one small file will be bin
backed per commit even though there could be more.
source
[link|https://github.com/apache/hudi/blob/3354fac42f9a2c4dbc8ac73ca4749160e9b9459b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/deltacommit/SparkUpsertDeltaCommitPartitioner.java#L66]
We can add a config which can control the total number of files considered as
small files for MOR table when index which cannot index log files are used.
We can leave the default value to 1 (current behavior). But for interested
users, this should be flexible.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)