[
https://issues.apache.org/jira/browse/HUDI-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
XixiHua updated HUDI-5250:
--------------------------
Description:
Currently, hudi obtains the average record size based on records written during
previous commits. Used for estimating how many records pack into one file, and
the code is about Upsert{artitioner.averageBytesPerRecord().
But we found that the single data file could become 600~700M and most other
files are less than 200M:
* Reason
** the result of totalBytesWritten/totalRecordsWritten is very small when the
last commit, but the next commit record is very large, then the data files will
become very large.
* Solve plan
** Plan1: calculate avgSize of the past several commit not just only one, but
the getCommitMetadata costs a lot of time, then this function might be slow, so
we did not choose this.
** Plan2: using the
> Using the default value of estimate record size at the
> averageBytesPerRecord() when estimation threshold is less than 0
> -----------------------------------------------------------------------------------------------------------------------
>
> Key: HUDI-5250
> URL: https://issues.apache.org/jira/browse/HUDI-5250
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: XixiHua
> Priority: Major
>
> Currently, hudi obtains the average record size based on records written
> during previous commits. Used for estimating how many records pack into one
> file, and the code is about Upsert{artitioner.averageBytesPerRecord().
> But we found that the single data file could become 600~700M and most other
> files are less than 200M:
> * Reason
> ** the result of totalBytesWritten/totalRecordsWritten is very small when
> the last commit, but the next commit record is very large, then the data
> files will become very large.
> * Solve plan
> ** Plan1: calculate avgSize of the past several commit not just only one,
> but the getCommitMetadata costs a lot of time, then this function might be
> slow, so we did not choose this.
> ** Plan2: using the
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)