[ 
https://issues.apache.org/jira/browse/HUDI-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XixiHua updated HUDI-5250:
--------------------------
    Description: 
Currently, hudi obtains the average record size based on records written during 
previous commits. Used for estimating how many records pack into one file, and 
the code is about Upsert{artitioner.averageBytesPerRecord().

But we found that the single data file could become 600~700M and most other 
files are less than 200M:
 *  Reason

 * 
 ** the result of totalBytesWritten/totalRecordsWritten is very small when the 
last commit, but the next commit record is very large, then the data files will 
become very large. 
 * Solve plan
 ** Plan1: calculate avgSize of the past several commit not just only one, but 
the getCommitMetadata costs a lot of time, then this function might be slow, so 
we did not choose this.
 ** Plan2: Use the estimated record size considering our data size is fixed in 
some sense, more the hudi community did not encourage adding a more boolean 
variables to control whether to use the last commit avgSize, then we use the 
thread

 

  was:
Currently, hudi obtains the average record size based on records written during 
previous commits. Used for estimating how many records pack into one file, and 
the code is about Upsert{artitioner.averageBytesPerRecord().

But we found that the single data file could become 600~700M and most other 
files are less than 200M:
 *  Reason

 ** the result of totalBytesWritten/totalRecordsWritten is very small when the 
last commit, but the next commit record is very large, then the data files will 
become very large. 
 * Solve plan
 ** Plan1: calculate avgSize of the past several commit not just only one, but 
the getCommitMetadata costs a lot of time, then this function might be slow, so 
we did not choose this.
 ** Plan2: using the 

 


> Using the default value of estimate record size at the 
> averageBytesPerRecord() when estimation threshold is less than 0
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-5250
>                 URL: https://issues.apache.org/jira/browse/HUDI-5250
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: XixiHua
>            Priority: Major
>
> Currently, hudi obtains the average record size based on records written 
> during previous commits. Used for estimating how many records pack into one 
> file, and the code is about Upsert{artitioner.averageBytesPerRecord().
> But we found that the single data file could become 600~700M and most other 
> files are less than 200M:
>  *  Reason
>  * 
>  ** the result of totalBytesWritten/totalRecordsWritten is very small when 
> the last commit, but the next commit record is very large, then the data 
> files will become very large. 
>  * Solve plan
>  ** Plan1: calculate avgSize of the past several commit not just only one, 
> but the getCommitMetadata costs a lot of time, then this function might be 
> slow, so we did not choose this.
>  ** Plan2: Use the estimated record size considering our data size is fixed 
> in some sense, more the hudi community did not encourage adding a more 
> boolean variables to control whether to use the last commit avgSize, then we 
> use the thread
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to