[ 
https://issues.apache.org/jira/browse/HUDI-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5250:
---------------------------------
    Labels: pull-request-available  (was: )

> Using the default value of estimate record size at the 
> averageBytesPerRecord() when estimation threshold is less than 0
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-5250
>                 URL: https://issues.apache.org/jira/browse/HUDI-5250
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: XixiHua
>            Priority: Major
>              Labels: pull-request-available
>
> Currently, hudi obtains the average record size based on records written 
> during previous commits. Used for estimating how many records pack into one 
> file, and the code is about UpsertPartitioner.averageBytesPerRecord().
> But we found that the single data file could become 600~700M and most other 
> files are less than 200M.
>  *  Reason
>  * 
>  ** the result of totalBytesWritten/totalRecordsWritten is very small when 
> the last commit, but the next commit record is very large, then the data 
> files will become very large. 
>  * Solve plan
>  ** Plan1: calculate avgSize of the past several commit not just only one, 
> but the getCommitMetadata costs a lot of time, then this function might be 
> slow, so we did not choose this.
>  ** Plan2: Use the estimated record size considering our data size is fixed 
> in some sense, more the hudi community did not encourage adding a more 
> boolean variable to control whether to use the last commit avgSize, then we 
> use the estimation threshold, when it is less than 0, we use the default 
> estimate record size.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to