bhasudha commented on a change in pull request #4076:
URL: https://github.com/apache/hudi/pull/4076#discussion_r754882176



##########
File path: website/docs/file_sizing.md
##########
@@ -0,0 +1,53 @@
+---
+title: "Auto File Size Management"
+toc: true
+---
+
+This doc will show you how Apache Hudi overcomes the dreaded small files 
problem. A key design decision in Hudi was to 
+avoid creating small files in the first place and always write properly sized 
files. 
+There are 2 ways to manage small files in Hudi and below will describe the 
advantages and trade-offs of each.
+
+## Auto-Size During ingestion
+
+You can automatically manage size of files during ingestion. This solution 
adds a little latency during ingestion, but
+it ensures that read queries are always efficient as soon as a write is 
committed. If you don't 
+manage file sizing as you write and instead try to periodically run a 
file-sizing clean-up, your queries will be slow until that resize cleanup is 
periodically performed.
+ 
+(Note: [bulk_insert](/docs/next/write_operations) write operation does not 
provide auto-sizing during ingestion)
+
+### For Copy-On-Write 
+This is as simple as configuring the [maximum size for a base/parquet 
file](/docs/configurations#hoodieparquetmaxfilesize) 
+and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below 
which a file should 
+be considered a small file. For the initial bootstrap of a Hudi table, tuning 
record size estimate is also important to 
+ensure sufficient records are bin-packed in a parquet file. For subsequent 
writes, Hudi automatically uses average 
+record size based on previous commit. Hudi will try to add enough records to a 
small file at write time to get it to the 
+configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and 
limitFileSize=120MB, Hudi will pick all 
+files < 100MB and try to get them upto 120MB.
+
+### For Merge-On-Read 
+MergeOnRead works differently for different INDEX choices so there are few 
more configs to set:  
+
+- Indexes with **canIndexLogFiles = true** : Inserts of new data go directly 
to log files. In this case, you can 
+configure the [maximum log size](/docs/configurations#hoodielogfilemaxsize) 
and a 

Review comment:
       (/docs/configurations/#hoodielogfilemaxsize)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to