[GitHub] [hudi] danny0405 commented on a diff in pull request #8093: Docs update1

via GitHub Mon, 06 Mar 2023 21:29:41 -0800


danny0405 commented on code in PR #8093:
URL: https://github.com/apache/hudi/pull/8093#discussion_r1127359729



##########
website/docs/file_sizing.md:
##########
@@ -3,51 +3,76 @@ title: "File Sizing"
 toc: true
 ---
 
-This doc will show you how Apache Hudi overcomes the dreaded small files 
problem. A key design decision in Hudi was to 
-avoid creating small files in the first place and always write properly sized 
files. 
-There are 2 ways to manage small files in Hudi and below will describe the 
advantages and trade-offs of each.
-
-## Auto-Size During ingestion
-
-You can automatically manage size of files during ingestion. This solution 
adds a little latency during ingestion, but
-it ensures that read queries are always efficient as soon as a write is 
committed. If you don't 
-manage file sizing as you write and instead try to periodically run a 
file-sizing clean-up, your queries will be slow until that resize cleanup is 
periodically performed.
- 
-(Note: [bulk_insert](/docs/next/write_operations) write operation does not 
provide auto-sizing during ingestion)
-
-### For Copy-On-Write 
-This is as simple as configuring the [maximum size for a base/parquet 
file](/docs/configurations#hoodieparquetmaxfilesize) 
-and the [soft limit](/docs/configurations#hoodieparquetsmallfilelimit) below 
which a file should 
-be considered a small file. For the initial bootstrap of a Hudi table, tuning 
record size estimate is also important to 
-ensure sufficient records are bin-packed in a parquet file. For subsequent 
writes, Hudi automatically uses average 
-record size based on previous commit. Hudi will try to add enough records to a 
small file at write time to get it to the 
-configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and 
limitFileSize=120MB, Hudi will pick all 
-files < 100MB and try to get them upto 120MB.
-
-### For Merge-On-Read 
-MergeOnRead works differently for different INDEX choices so there are few 
more configs to set:  
-
-- Indexes with **canIndexLogFiles = true** : Inserts of new data go directly 
to log files. In this case, you can 
-configure the [maximum log size](/docs/configurations#hoodielogfilemaxsize) 
and a 
-[factor](/docs/configurations#hoodielogfiletoparquetcompressionratio) that 
denotes reduction in 
-size when data moves from avro to parquet files.
-- Indexes with **canIndexLogFiles = false** : Inserts of new data go only to 
parquet files. In this case, the 
-same configurations as above for the COPY_ON_WRITE case applies.
-
-NOTE : In either case, small files will be auto sized only if there is no 
PENDING compaction or associated log file for 
-that particular file slice. For example, for case 1: If you had a log file and 
a compaction C1 was scheduled to convert 
-that log file to parquet, no more inserts can go into that log file. For case 
2: If you had a parquet file and an update 
-ended up creating an associated delta log file, no more inserts can go into 
that parquet file. Only after the compaction 
-has been performed and there are NO log files associated with the base parquet 
file, can new inserts be sent to auto size that parquet file.
-
-## Auto-Size With Clustering
-**[Clustering](/docs/next/clustering)** is a feature in Hudi to group 
-small files into larger ones either synchronously or asynchronously. Since 
first solution of auto-sizing small files has 
-a tradeoff on ingestion speed (since the small files are sized during 
ingestion), if your use-case is very sensitive to 
-ingestion latency where you don't want to compromise on ingestion speed which 
may end up creating a lot of small files, 
-clustering comes to the rescue. Clustering can be scheduled through the 
ingestion job and an asynchronus job can stitch 
-small files together in the background to generate larger files. NOTE that 
during this, ingestion can continue to run concurrently.
-
-*Please note that Hudi always creates immutable files on disk. To be able to 
do auto-sizing or clustering, Hudi will 
-always create a newer version of the smaller file, resulting in 2 versions of 
the same file. 
-The [cleaner service](/docs/next/hoodie_cleaner) will later kick in and delete 
the older version small file and keep the latest one.*
\ No newline at end of file
+A fundamental problem when writing data to a source is having a lot of small 
files. This is also known as a small file problem. If you don’t size the files 
appropriately, you can slow down the query performance and work with stale 
analytics. Some of the issues you may encounter with small files include the 
following:
+- **Reads slow down**: You’ll have to scan through many small files to 
retrieve data for a query. It’s a very inefficient way of accessing and 
utilizing the data.
+
+- **Processes slow down**: You can slow down your i.e., Spark or Hive jobs; 
the more files you have, the more tasks you create for reading each file.
+
+- **Storage increases**: When working with a lot of data, you can be 
inefficient in using your storage. For example, many small files can have a 
lower compression ratio, leading to more data on storage. If you’re indexing 
the data, that also takes up more storage space inside the Parquet files. If 
you’re working with a small amount of data, you might not see a significant 
impact with storage. However, when dealing with petabyte and exabyte data, 
you’ll need to be efficient in managing storage resources. 
+
+All these challenges inevitably lead to stale analytics and scalability 
challenges:
+- Query performance slows down.
+- Jobs could be running faster.
+- You’re utilizing way more resources. 
+
+A critical design decision in the Hudi architecture is to avoid small file 
creation. Hudi is uniquely designed to write appropriately sized files 
automatically. This document will show you how Apache Hudi overcomes the 
dreaded small files problem. There are two ways to manage small files in Hudi: 
+
+- Auto-size during ingestion
+- Clustering
+
+Below, we will describe the advantages and trade-offs of each.
+
+## Auto-sizing during ingestion
+
+You can manage file sizes through Hudi’s auto-sizing capability during 
ingestion. The default targeted file size for Parquet base files is 120MB, 
which can be configured by hoodie.parquet.max.file.size. Auto-sizing may add 
some data latency, but it ensures that the read queries are always efficient as 
soon as a write transaction is committed. It’s important to note that if you 
don’t manage file sizing as you write and,  instead, try to run clustering to 
fix your file sizing periodically, your queries might be slow until the point 
when the clustering finishes.
+
+**Note**: the bulk_insert write operation does not have auto-sizing 
capabilities during ingestion
+
+If you need to customize the file sizing, i.e., increase the target file size 
or change how small files are identified, follow the instructions below for 
Copy-On-Write and Merge-On-Read.
+
+### Copy-On-Write (COW)
+To tune the file sizing for a COW table, you can set the small file limit and 
the maximum Parquet file size. Hudi will try to add enough records to a small 
file at write time to get it to the configured maximum limit.
+
+ - For example, if the `hoodie.parquet.small.file.limit=104857600` (100MB) and 
`hoodie.parquet.max.file.size=125829120` (120MB), Hudi will pick all files < 
100MB and try to get them up to 120MB.
+
+For creating a Hudi table initially, setting an accurate record size estimate 
is vital to ensure Hudi can adequately estimate how many records need to be 
bin-packed in a Parquet file for the first ingestion batch. Then, Hudi 
automatically uses the average record size for subsequent writes based on 
previous commits.
+
+Here are the important configuration of interest:
+
+## Merge-On-Read (MOR) 
+As a MOR table aims to reduce the write amplification, compared to a COW 
table, when writing to a MOR table, Hudi limits the number of Parquet base 
files to one for auto file sizing during insert and upsert operation. This 
limits the number of rewritten files. This can be configured through 
`hoodie.merge.small.file.group.candidates.limit`.
+
+In addition to file sizing Parquet base files for a MOR table, you can also 
tune the log files file-sizing with `hoodie.logfile.max.size`. 
+
+**NOTE**:  Small files in file groups included in the requested or inflight 
compaction or clustering under the active timeline, or small files with 
associated log files are not auto-sized with incoming inserts until the 
compaction or clustering is complete. For example: 
+

Review Comment:
   The NOTE is tenable only for BloomFilter index type. For Flink state index, 
the restriction is gone.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] danny0405 commented on a diff in pull request #8093: Docs update1

Reply via email to