[
https://issues.apache.org/jira/browse/HUDI-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-5250:
---------------------------------
Labels: pull-request-available (was: )
> Using the default value of estimate record size at the
> averageBytesPerRecord() when estimation threshold is less than 0
> -----------------------------------------------------------------------------------------------------------------------
>
> Key: HUDI-5250
> URL: https://issues.apache.org/jira/browse/HUDI-5250
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: XixiHua
> Priority: Major
> Labels: pull-request-available
>
> Currently, hudi obtains the average record size based on records written
> during previous commits. Used for estimating how many records pack into one
> file, and the code is about UpsertPartitioner.averageBytesPerRecord().
> But we found that the single data file could become 600~700M and most other
> files are less than 200M.
> * Reason
> *
> ** the result of totalBytesWritten/totalRecordsWritten is very small when
> the last commit, but the next commit record is very large, then the data
> files will become very large.
> * Solve plan
> ** Plan1: calculate avgSize of the past several commit not just only one,
> but the getCommitMetadata costs a lot of time, then this function might be
> slow, so we did not choose this.
> ** Plan2: Use the estimated record size considering our data size is fixed
> in some sense, more the hudi community did not encourage adding a more
> boolean variable to control whether to use the last commit avgSize, then we
> use the estimation threshold, when it is less than 0, we use the default
> estimate record size.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)