[ 
https://issues.apache.org/jira/browse/HUDI-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weiming updated HUDI-5315:
--------------------------
    Description: 
Although hudi has the function of dynamically estimating the size of the 
record, but it can only take effect if certain conditions are met, when the 
user commits for the first time, the default is to use 
[hoodie.copyonwrite.record.size.estimate = 1024], if the amount of data for the 
first commit is very large, and the user 
[hoodie.copyonwrite.record.size.estimate] parameter setting is not reasonable, 
it will lead to a lot of small files.

 

We wrote data about 100G for the first time:

We use the default param [hoodie.copyonwrite.record.size.estimate = 1024], a 
total of more than 1500 small files

!image2022-11-24_15-42-49.png|width=723,height=209!

 

After modification, we set the parameters
set hoodie.copyonwrite.record.dynamic.sample.maxnum = 100;
set hoodie.copyonwrite.record.dynamic.sample.ratio = 0.1;
A total of 97 normal size files were generated

!image2022-11-24_15-44-16.png|width=762,height=206!

 

  was:Although hudi has the function of dynamically estimating the size of the 
record, but it can only take effect if certain conditions are met, when the 
user commits for the first time, the default is to use 
[hoodie.copyonwrite.record.size.estimate = 1024], if the amount of data for the 
first commit is very large, and the user 
[hoodie.copyonwrite.record.size.estimate] parameter setting is not reasonable, 
it will lead to a lot of small files.


> The record size is dynamically estimated when the table is first written
> ------------------------------------------------------------------------
>
>                 Key: HUDI-5315
>                 URL: https://issues.apache.org/jira/browse/HUDI-5315
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: spark-sql, writer-core
>            Reporter: weiming
>            Assignee: weiming
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.12.2
>
>         Attachments: image2022-11-24_15-42-49.png, 
> image2022-11-24_15-44-16.png
>
>
> Although hudi has the function of dynamically estimating the size of the 
> record, but it can only take effect if certain conditions are met, when the 
> user commits for the first time, the default is to use 
> [hoodie.copyonwrite.record.size.estimate = 1024], if the amount of data for 
> the first commit is very large, and the user 
> [hoodie.copyonwrite.record.size.estimate] parameter setting is not 
> reasonable, it will lead to a lot of small files.
>  
> We wrote data about 100G for the first time:
> We use the default param [hoodie.copyonwrite.record.size.estimate = 1024], a 
> total of more than 1500 small files
> !image2022-11-24_15-42-49.png|width=723,height=209!
>  
> After modification, we set the parameters
> set hoodie.copyonwrite.record.dynamic.sample.maxnum = 100;
> set hoodie.copyonwrite.record.dynamic.sample.ratio = 0.1;
> A total of 97 normal size files were generated
> !image2022-11-24_15-44-16.png|width=762,height=206!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to