[ 
https://issues.apache.org/jira/browse/HUDI-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weiming updated HUDI-5315:
--------------------------
    Status: In Progress  (was: Open)

> The record size is dynamically estimated when the table is first written
> ------------------------------------------------------------------------
>
>                 Key: HUDI-5315
>                 URL: https://issues.apache.org/jira/browse/HUDI-5315
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: spark-sql, writer-core
>            Reporter: weiming
>            Assignee: weiming
>            Priority: Blocker
>              Labels: pull-request-available
>         Attachments: image2022-11-24_15-42-49.png, 
> image2022-11-24_15-44-16.png
>
>
> Although hudi has the function of dynamically estimating the size of the 
> record, but it can only take effect if certain conditions are met, when the 
> user commits for the first time, the default is to use 
> [hoodie.copyonwrite.record.size.estimate = 1024], if the amount of data for 
> the first commit is very large, and the user 
> [hoodie.copyonwrite.record.size.estimate] parameter setting is not 
> reasonable, it will lead to a lot of small files.
>  
> We wrote data about 100G for the first time:
> We use the default param [hoodie.copyonwrite.record.size.estimate = 1024], a 
> total of more than 1500 small files
> !image2022-11-24_15-42-49.png|width=723,height=209!
>  
> After modification, we set the parameters
> set hoodie.copyonwrite.record.dynamic.sample.maxnum = 100;
> set hoodie.copyonwrite.record.dynamic.sample.ratio = 0.1;
> A total of 97 normal size files were generated
> !image2022-11-24_15-44-16.png|width=762,height=206!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to