[ 
https://issues.apache.org/jira/browse/HUDI-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XixiHua updated HUDI-5250:
--------------------------
    Description: 
Currently, hudi obtains the average record size based on records written during 
previous commits. Used for estimating how many records pack into one file, and 
the code is about Upsert{artitioner.averageBytesPerRecord().

But we found that the single data file could become 600~700M and most other 
files are less than 200M:
 *  Reason

 ** the result of totalBytesWritten/totalRecordsWritten is very small when the 
last commit, but the next commit record is very large, then the data files will 
become very large. 
 * Solve plan
 ** Plan1: calculate avgSize of the past several commit not just only one, but 
the getCommitMetadata costs a lot of time, then this function might be slow, so 
we did not choose this.
 ** Plan2: using the 

 

> Using the default value of estimate record size at the 
> averageBytesPerRecord() when estimation threshold is less than 0
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-5250
>                 URL: https://issues.apache.org/jira/browse/HUDI-5250
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: XixiHua
>            Priority: Major
>
> Currently, hudi obtains the average record size based on records written 
> during previous commits. Used for estimating how many records pack into one 
> file, and the code is about Upsert{artitioner.averageBytesPerRecord().
> But we found that the single data file could become 600~700M and most other 
> files are less than 200M:
>  *  Reason
>  ** the result of totalBytesWritten/totalRecordsWritten is very small when 
> the last commit, but the next commit record is very large, then the data 
> files will become very large. 
>  * Solve plan
>  ** Plan1: calculate avgSize of the past several commit not just only one, 
> but the getCommitMetadata costs a lot of time, then this function might be 
> slow, so we did not choose this.
>  ** Plan2: using the 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to