This is an automated email from the ASF dual-hosted git repository.
bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 8373c8b [HUDI-2821] - Docs for Metadata Table - added reference to
vc's benchmark study (#4260)
8373c8b is described below
commit 8373c8b1dc16f759bc5ed902daa1c9f67c41a6de
Author: Kyle Weller <[email protected]>
AuthorDate: Thu Dec 9 10:09:36 2021 -0800
[HUDI-2821] - Docs for Metadata Table - added reference to vc's benchmark
study (#4260)
* added reference to vc's benchmark study
* moved metadata to concepts
---
website/docs/metadata.md | 25 +++++++++++++++++--------
website/sidebars.js | 2 +-
2 files changed, 18 insertions(+), 9 deletions(-)
diff --git a/website/docs/metadata.md b/website/docs/metadata.md
index 13cf669..ec4b2ee 100644
--- a/website/docs/metadata.md
+++ b/website/docs/metadata.md
@@ -5,14 +5,23 @@ keywords: [ hudi, metadata, S3 file listings]
## Motivation for a Metadata Table
-The Apache Hudi Metadata Table can significantly improve read/write
performance of your queries. The main purpose of the
-Metadata Table is:
-
-1. **Eliminate the requirement for the "list files" operation:**
- 1. When reading and writing data, file listing operations are performed to
get the current view of the file system.
- When data sets are large, listing all the files becomes a performance
bottleneck and in the case of cloud storage systems
- like AWS S3, sometimes causes throttling due to list operation request
limits. The Metadata Table will instead
- proactively maintain the list of files and remove the need for recursive
file listing operations.
+The Apache Hudi Metadata Table can significantly improve read/write
performance of your queries. The main purpose of the
+Metadata Table is to eliminate the requirement for the "list files" operation.
+
+When reading and writing data, file listing operations are performed to get
the current view of the file system.
+When data sets are large, listing all the files may be a performance
bottleneck, but more importantly in the case of cloud storage systems
+like AWS S3, the large number of file listing requests sometimes causes
throttling due to certain request limits.
+The Metadata Table will instead proactively maintain the list of files and
remove the need for recursive file listing operations
+
+### Some numbers from a study:
+Running a TPCDS benchmark the p50 list latencies for a single folder scales
~linearly with the amount of files/objects:
+
+|Number of files/objects|100|1K|10K|100K|
+|---|---|---|---|---|
+|P50 list latency|50ms|131ms|1062ms|9932ms|
+
+Whereas listings from the Metadata Table will not scale linearly with
file/object count and instead take about 100-500ms per read even for very large
tables.
+Even better, the timeline server caches portions of the metadata (currently
only for writers), and provides ~10ms performance for listings.
## Enable Hudi Metadata Table
The Hudi Metadata Table is not enabled by default. If you wish to turn it on
you need to enable the following configuration:
diff --git a/website/sidebars.js b/website/sidebars.js
index 23b9227..adc3cfb 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -26,6 +26,7 @@ module.exports = {
'table_types',
'indexing',
'file_layouts',
+ 'metadata',
'write_operations',
'schema_evolution',
'key_generation',
@@ -54,7 +55,6 @@ module.exports = {
'transforms',
'markers',
'file_sizing',
- 'metadata',
'snapshot_exporter',
'precommit_validator'
],