[GitHub] [hudi] manojpec commented on a change in pull request #4226: [HUDI-2821] - Docs for Metadata Table

GitBox Mon, 06 Dec 2021 09:50:28 -0800


manojpec commented on a change in pull request #4226:
URL: https://github.com/apache/hudi/pull/4226#discussion_r763235355




##########
File path: website/docs/metadata.md
##########
@@ -0,0 +1,34 @@
+---
+title: Metadata Table
+keywords: [ hudi, metadata, S3 file listings]
+---
+
+## Motivation for a Metadata Table
+
+The Apache Hudi Metadata Table can significantly improve read/write 
performance of your queries. The two main purposes of the 
+Metadata Table are:
+
+1. **Eliminate the requirement for the "list files" operation:**
+   1. When reading, writing data in HDFS, file listing operations are 
performed to get the current view of the file system.
+      When data sets are large, listing all the files becomes a performance 
bottleneck and in the case of cloud storage systems
+      like AWS S3, sometimes causes throttling due to list operation request 
limits. The Metadata Table will instead
+      proactively maintain the list of files and remove the need for recursive 
file listing operations on HDFS.
+2. **Create Column Indexes for better query planning and faster lookups by 
readers** 

Review comment:
       nit: Btw, this feature is not yet in 0.10.0. Should we say this metadata 
table is foundational block for building future performance oriented features, 
like this column index, etc., ?

##########
File path: website/docs/metadata.md
##########
@@ -0,0 +1,34 @@
+---
+title: Metadata Table
+keywords: [ hudi, metadata, S3 file listings]
+---
+
+## Motivation for a Metadata Table
+
+The Apache Hudi Metadata Table can significantly improve read/write 
performance of your queries. The two main purposes of the 
+Metadata Table are:
+
+1. **Eliminate the requirement for the "list files" operation:**
+   1. When reading, writing data in HDFS, file listing operations are 
performed to get the current view of the file system.
+      When data sets are large, listing all the files becomes a performance 
bottleneck and in the case of cloud storage systems
+      like AWS S3, sometimes causes throttling due to list operation request 
limits. The Metadata Table will instead
+      proactively maintain the list of files and remove the need for recursive 
file listing operations on HDFS.
+2. **Create Column Indexes for better query planning and faster lookups by 
readers** 

Review comment:
       nit: indexes vs indices

##########
File path: website/docs/metadata.md
##########
@@ -0,0 +1,34 @@
+---
+title: Metadata Table
+keywords: [ hudi, metadata, S3 file listings]
+---
+
+## Motivation for a Metadata Table
+
+The Apache Hudi Metadata Table can significantly improve read/write 
performance of your queries. The two main purposes of the 
+Metadata Table are:
+
+1. **Eliminate the requirement for the "list files" operation:**
+   1. When reading, writing data in HDFS, file listing operations are 
performed to get the current view of the file system.
+      When data sets are large, listing all the files becomes a performance 
bottleneck and in the case of cloud storage systems
+      like AWS S3, sometimes causes throttling due to list operation request 
limits. The Metadata Table will instead
+      proactively maintain the list of files and remove the need for recursive 
file listing operations on HDFS.

Review comment:
       On any backing storage. Not just HDFS. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] manojpec commented on a change in pull request #4226: [HUDI-2821] - Docs for Metadata Table

Reply via email to