[
https://issues.apache.org/jira/browse/HUDI-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Raymond Xu updated HUDI-5315:
-----------------------------
Reviewers: Raymond Xu
> The record size is dynamically estimated when the table is first written
> ------------------------------------------------------------------------
>
> Key: HUDI-5315
> URL: https://issues.apache.org/jira/browse/HUDI-5315
> Project: Apache Hudi
> Issue Type: Improvement
> Components: spark-sql, writer-core
> Reporter: weiming
> Assignee: weiming
> Priority: Blocker
> Labels: pull-request-available
> Attachments: image2022-11-24_15-42-49.png,
> image2022-11-24_15-44-16.png
>
>
> Although hudi has the function of dynamically estimating the size of the
> record, but it can only take effect if certain conditions are met, when the
> user commits for the first time, the default is to use
> [hoodie.copyonwrite.record.size.estimate = 1024], if the amount of data for
> the first commit is very large, and the user
> [hoodie.copyonwrite.record.size.estimate] parameter setting is not
> reasonable, it will lead to a lot of small files.
>
> We wrote data about 100G for the first time:
> We use the default param [hoodie.copyonwrite.record.size.estimate = 1024], a
> total of more than 1500 small files
> !image2022-11-24_15-42-49.png|width=723,height=209!
>
> After modification, we set the parameters
> set hoodie.copyonwrite.record.dynamic.sample.maxnum = 100;
> set hoodie.copyonwrite.record.dynamic.sample.ratio = 0.1;
> A total of 97 normal size files were generated
> !image2022-11-24_15-44-16.png|width=762,height=206!
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)