[GitHub] [hudi] n3nash commented on a change in pull request #2612: [HUDI-1563] Adding hudi file sizing/ small file management blog

GitBox Mon, 01 Mar 2021 20:37:41 -0800


n3nash commented on a change in pull request #2612:
URL: https://github.com/apache/hudi/pull/2612#discussion_r585245488




##########
File path: docs/_posts/2021-03-01-hudi-file-sizing.md
##########
@@ -0,0 +1,96 @@
+---
+title: "Apache Hudi File Sizing"
+excerpt: "How Apache hudi manages to maintain optimum sized files to maintain 
read SLAs"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi is a data platform technology that provides several 
functionalities needed to build and manage a data lake. 
+One of the key features that hudi provides for you is self managing file 
sizing so that users don’t need to worry about 
+small files in their dataset. Having a lot of small files will make it harder 
to maintain your SLAs for read queries. 
+But for streaming data lake use-cases, inherently ingests are going to end up 
having smaller volume of writes, which 
+might result in lot of small files if no special management is done.
+
+# Apache Hudi file size management
+
+Hudi avoids such small files and always writes properly sized files, taking a 
slight hit on ingestion but guaranteeing 

Review comment:
       Instead of saying a `slight hit on ingestion` may be word it. This block 
can be reworded :
   
   Hudi provides ways to write properly sized files during ingestion to 
guarantee SLAs for your read queries. Common approaches to writing very small 
files and then later stitching them together solve for system scalability 
issues posed by small files but might violate query SLA's by exposing small 
files to them. One can leverage Hudi's clustering feature that’s part of Hudi, 
but if you want to have a) self managed file sizing b) Avoid exposing small 
files to queries, automatic file sizing feature is a savior. 
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] n3nash commented on a change in pull request #2612: [HUDI-1563] Adding hudi file sizing/ small file management blog

Reply via email to