This is an automated email from the ASF dual-hosted git repository.
xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 330f2aa05d [HUDI-3930][Docs] Adding documentation around Data Skipping
(#5440)
330f2aa05d is described below
commit 330f2aa05d40312d764d5331d4ce92c1315f3914
Author: Alexey Kudinkin <[email protected]>
AuthorDate: Sun May 1 02:54:02 2022 -0700
[HUDI-3930][Docs] Adding documentation around Data Skipping (#5440)
Co-authored-by: Raymond Xu <[email protected]>
---
website/docs/performance.md | 77 +++++++++++++++-------
website/releases/release-0.11.0.md | 2 +
.../versioned_docs/version-0.11.0/performance.md | 77 +++++++++++++++-------
3 files changed, 108 insertions(+), 48 deletions(-)
diff --git a/website/docs/performance.md b/website/docs/performance.md
index db78a7f25b..162b7cc85e 100644
--- a/website/docs/performance.md
+++ b/website/docs/performance.md
@@ -23,12 +23,14 @@ Here are some ways to efficiently manage the storage of
your Hudi tables.
once created cannot be deleted, but simply expanded as explained before.
- For workloads with heavy updates, the [merge-on-read
table](/docs/concepts#merge-on-read-table) provides a nice mechanism for
ingesting quickly into smaller files and then later merging them into larger
base files via compaction.
-## Performance Gains
+## Performance Optimizations
In this section, we go over some real world performance numbers for Hudi
upserts, incremental pull and compare them against
-the conventional alternatives for achieving these tasks.
+the conventional alternatives for achieving these tasks.
-### Upserts
+### Write Path
+
+#### Upserts
Following shows the speed up obtained for NoSQL database ingestion, from
incrementally upserting on a Hudi table on the copy-on-write storage,
on 5 tables ranging from small to huge (as opposed to bulk loading the tables)
@@ -47,38 +49,65 @@ significant savings on the overall compute cost.
Hudi upserts have been stress tested upto 4TB in a single commit across the t1
table.
See [here](https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide) for
some tuning tips.
-### Indexing
+#### Indexing
-In order to efficiently upsert data, Hudi needs to classify records in a write
batch into inserts & updates (tagged with the file group
-it belongs to). In order to speed this operation, Hudi employs a pluggable
index mechanism that stores a mapping between recordKey and
+In order to efficiently upsert data, Hudi needs to classify records in a write
batch into inserts & updates (tagged with the file group
+it belongs to). In order to speed this operation, Hudi employs a pluggable
index mechanism that stores a mapping between recordKey and
the file group id it belongs to. By default, Hudi uses a built in index that
uses file ranges and bloom filters to accomplish this, with
-upto 10x speed up over a spark join to do the same.
+upto 10x speed up over a spark join to do the same.
Hudi provides best indexing performance when you model the recordKey to be
monotonically increasing (e.g timestamp prefix), leading to range pruning
filtering
out a lot of files for comparison. Even for UUID based keys, there are [known
techniques](https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/)
to achieve this.
-For e.g , with 100M timestamp prefixed keys (5% updates, 95% inserts) on a
event table with 80B keys/3 partitions/11416 files/10TB data, Hudi index
achieves a
-**~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a
challenging workload like an '100% update' database ingestion workload spanning
+For e.g , with 100M timestamp prefixed keys (5% updates, 95% inserts) on a
event table with 80B keys/3 partitions/11416 files/10TB data, Hudi index
achieves a
+**~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a
challenging workload like an '100% update' database ingestion workload spanning
3.25B UUID keys/30 partitions/6180 files using 300 cores, Hudi indexing offers
a **80-100% speedup**.
-### Snapshot Queries
-The major design goal for snapshot queries is to achieve the latency reduction
& efficiency gains in previous section,
-with no impact on queries. Following charts compare the Hudi vs non-Hudi
tables across Hive/Presto/Spark queries and demonstrate this.
+### Read Path
-**Hive**
+#### Data Skipping
+
+Data Skipping is a technique (originally introduced in Hudi 0.10) that
leverages files metadata to very effectively prune the search space, by
+avoiding reading (even footers of) the files that are known (based on the
metadata) to only contain the data that _does not match_ the query's filters.
-<figure>
- <img className="docimage"
src={require("/assets/images/hudi_query_perf_hive.png").default}
alt="hudi_query_perf_hive.png" />
-</figure>
+Data Skipping is leveraging Metadata Table's Column Stats Index bearing
column-level statistics (such as min-value, max-value, count of null-values in
the column, etc)
+for every file of the Hudi table. This then allows Hudi for every incoming
query instead of enumerating every file in the table and reading its
corresponding metadata
+(for ex, Parquet footers) for analysis whether it could contain any data
matching the query filters, to simply do a query against a Column Stats Index
+in the Metadata Table (which in turn is a Hudi table itself) and within
seconds (even for TBs scale tables, with 10s of thousands of files) obtain the
list
+of _all the files that might potentially contain the data_ matching query's
filters with crucial property that files that could be ruled out as not
containing such data
+(based on their column-level statistics) will be stripped out.
-**Spark**
+In spirit, Data Skipping is very similar to Partition Pruning for tables using
Physical Partitioning where records in the dataset are partitioned on disk
+into a folder structure based on some column's value or its derivative
(clumping records together based on some intrinsic measure), but instead
+of on-disk folder structure, Data Skipping leverages index maintaining a
mapping "file → columns' statistics" for all of the columns persisted
+within that file.
-<figure>
- <img className="docimage"
src={require("/assets/images/hudi_query_perf_spark.png").default}
alt="hudi_query_perf_spark.png" />
-</figure>
+For very large tables (1Tb+, 10s of 1000s of files), Data skipping could
-**Presto**
+1. Substantially improve query execution runtime (by avoiding fruitless
Compute churn) in excess of **10x** as compared to the same query on the same
dataset but w/o Data Skipping enabled.
+2. Help avoid hitting Cloud Storages throttling limits (for issuing too many
requests, for ex, AWS limits # of requests / sec that could be issued based on
the object's prefix which considerably complicates things for partitioned
tables)
-<figure>
- <img className="docimage"
src={require("/assets/images/hudi_query_perf_presto.png").default}
alt="hudi_query_perf_presto.png" />
-</figure>
+If you're interested in learning more details around how Data Skipping is
working internally please watch out for a blog-post coming out on this soon!
+
+To unlock the power of Data Skipping you will need to
+
+1. Enable Metadata Table along with Column Stats Index on the _write path_
(See [Async Meta Indexing](/docs/async_meta_indexing)).
+2. Enable Data Skipping in your queries
+
+To enable Metadata Table along with Column Stats Index on the write path, make
sure
+the following configurations are set to `true`:
+
+ - `hoodie.metadata.enable` (to enable Metadata Table on the write path,
enabled by default)
+ - `hoodie.metadata.index.column.stats.enable` (to enable Column Stats Index
being populated on the write path, disabled by default)
+
+:::note
+If you're planning on enabling Column Stats Index for already existing table,
please check out the [Async Meta Indexing](/docs/async_meta_indexing) guide
+on how to build Metadata Table Indices (such as Column Stats Index) for
existing tables.
+:::
+
+
+To enable Data Skipping in your queries make sure to set following properties
to `true` (on the read path):
+
+ - `hoodie.enable.data.skipping` (to enable Data Skipping)
+ - `hoodie.metadata.enable` (to enable Metadata Table use on the read path)
+ - `hoodie.metadata.index.column.stats.enable` (to enable Column Stats Index
use on the read path)
diff --git a/website/releases/release-0.11.0.md
b/website/releases/release-0.11.0.md
index 2916b12b86..0662eeddee 100644
--- a/website/releases/release-0.11.0.md
+++ b/website/releases/release-0.11.0.md
@@ -48,6 +48,8 @@ following: `date_format(ts, "MM/dd/yyyy" ) < "04/01/2022"`.
*Note: Currently Data Skipping is only supported in COW tables and MOR tables
in read-optimized mode. The work of full
support for MOR tables is tracked in
[HUDI-3866](https://issues.apache.org/jira/browse/HUDI-3866)*
+*Refer to the [performance](/docs/performance#read-path) guide for more info.*
+
### Async Indexer
In 0.11.0, we added a new asynchronous service for indexing to our rich set of
table services. It allows users to create
diff --git a/website/versioned_docs/version-0.11.0/performance.md
b/website/versioned_docs/version-0.11.0/performance.md
index db78a7f25b..162b7cc85e 100644
--- a/website/versioned_docs/version-0.11.0/performance.md
+++ b/website/versioned_docs/version-0.11.0/performance.md
@@ -23,12 +23,14 @@ Here are some ways to efficiently manage the storage of
your Hudi tables.
once created cannot be deleted, but simply expanded as explained before.
- For workloads with heavy updates, the [merge-on-read
table](/docs/concepts#merge-on-read-table) provides a nice mechanism for
ingesting quickly into smaller files and then later merging them into larger
base files via compaction.
-## Performance Gains
+## Performance Optimizations
In this section, we go over some real world performance numbers for Hudi
upserts, incremental pull and compare them against
-the conventional alternatives for achieving these tasks.
+the conventional alternatives for achieving these tasks.
-### Upserts
+### Write Path
+
+#### Upserts
Following shows the speed up obtained for NoSQL database ingestion, from
incrementally upserting on a Hudi table on the copy-on-write storage,
on 5 tables ranging from small to huge (as opposed to bulk loading the tables)
@@ -47,38 +49,65 @@ significant savings on the overall compute cost.
Hudi upserts have been stress tested upto 4TB in a single commit across the t1
table.
See [here](https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide) for
some tuning tips.
-### Indexing
+#### Indexing
-In order to efficiently upsert data, Hudi needs to classify records in a write
batch into inserts & updates (tagged with the file group
-it belongs to). In order to speed this operation, Hudi employs a pluggable
index mechanism that stores a mapping between recordKey and
+In order to efficiently upsert data, Hudi needs to classify records in a write
batch into inserts & updates (tagged with the file group
+it belongs to). In order to speed this operation, Hudi employs a pluggable
index mechanism that stores a mapping between recordKey and
the file group id it belongs to. By default, Hudi uses a built in index that
uses file ranges and bloom filters to accomplish this, with
-upto 10x speed up over a spark join to do the same.
+upto 10x speed up over a spark join to do the same.
Hudi provides best indexing performance when you model the recordKey to be
monotonically increasing (e.g timestamp prefix), leading to range pruning
filtering
out a lot of files for comparison. Even for UUID based keys, there are [known
techniques](https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/)
to achieve this.
-For e.g , with 100M timestamp prefixed keys (5% updates, 95% inserts) on a
event table with 80B keys/3 partitions/11416 files/10TB data, Hudi index
achieves a
-**~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a
challenging workload like an '100% update' database ingestion workload spanning
+For e.g , with 100M timestamp prefixed keys (5% updates, 95% inserts) on a
event table with 80B keys/3 partitions/11416 files/10TB data, Hudi index
achieves a
+**~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a
challenging workload like an '100% update' database ingestion workload spanning
3.25B UUID keys/30 partitions/6180 files using 300 cores, Hudi indexing offers
a **80-100% speedup**.
-### Snapshot Queries
-The major design goal for snapshot queries is to achieve the latency reduction
& efficiency gains in previous section,
-with no impact on queries. Following charts compare the Hudi vs non-Hudi
tables across Hive/Presto/Spark queries and demonstrate this.
+### Read Path
-**Hive**
+#### Data Skipping
+
+Data Skipping is a technique (originally introduced in Hudi 0.10) that
leverages files metadata to very effectively prune the search space, by
+avoiding reading (even footers of) the files that are known (based on the
metadata) to only contain the data that _does not match_ the query's filters.
-<figure>
- <img className="docimage"
src={require("/assets/images/hudi_query_perf_hive.png").default}
alt="hudi_query_perf_hive.png" />
-</figure>
+Data Skipping is leveraging Metadata Table's Column Stats Index bearing
column-level statistics (such as min-value, max-value, count of null-values in
the column, etc)
+for every file of the Hudi table. This then allows Hudi for every incoming
query instead of enumerating every file in the table and reading its
corresponding metadata
+(for ex, Parquet footers) for analysis whether it could contain any data
matching the query filters, to simply do a query against a Column Stats Index
+in the Metadata Table (which in turn is a Hudi table itself) and within
seconds (even for TBs scale tables, with 10s of thousands of files) obtain the
list
+of _all the files that might potentially contain the data_ matching query's
filters with crucial property that files that could be ruled out as not
containing such data
+(based on their column-level statistics) will be stripped out.
-**Spark**
+In spirit, Data Skipping is very similar to Partition Pruning for tables using
Physical Partitioning where records in the dataset are partitioned on disk
+into a folder structure based on some column's value or its derivative
(clumping records together based on some intrinsic measure), but instead
+of on-disk folder structure, Data Skipping leverages index maintaining a
mapping "file → columns' statistics" for all of the columns persisted
+within that file.
-<figure>
- <img className="docimage"
src={require("/assets/images/hudi_query_perf_spark.png").default}
alt="hudi_query_perf_spark.png" />
-</figure>
+For very large tables (1Tb+, 10s of 1000s of files), Data skipping could
-**Presto**
+1. Substantially improve query execution runtime (by avoiding fruitless
Compute churn) in excess of **10x** as compared to the same query on the same
dataset but w/o Data Skipping enabled.
+2. Help avoid hitting Cloud Storages throttling limits (for issuing too many
requests, for ex, AWS limits # of requests / sec that could be issued based on
the object's prefix which considerably complicates things for partitioned
tables)
-<figure>
- <img className="docimage"
src={require("/assets/images/hudi_query_perf_presto.png").default}
alt="hudi_query_perf_presto.png" />
-</figure>
+If you're interested in learning more details around how Data Skipping is
working internally please watch out for a blog-post coming out on this soon!
+
+To unlock the power of Data Skipping you will need to
+
+1. Enable Metadata Table along with Column Stats Index on the _write path_
(See [Async Meta Indexing](/docs/async_meta_indexing)).
+2. Enable Data Skipping in your queries
+
+To enable Metadata Table along with Column Stats Index on the write path, make
sure
+the following configurations are set to `true`:
+
+ - `hoodie.metadata.enable` (to enable Metadata Table on the write path,
enabled by default)
+ - `hoodie.metadata.index.column.stats.enable` (to enable Column Stats Index
being populated on the write path, disabled by default)
+
+:::note
+If you're planning on enabling Column Stats Index for already existing table,
please check out the [Async Meta Indexing](/docs/async_meta_indexing) guide
+on how to build Metadata Table Indices (such as Column Stats Index) for
existing tables.
+:::
+
+
+To enable Data Skipping in your queries make sure to set following properties
to `true` (on the read path):
+
+ - `hoodie.enable.data.skipping` (to enable Data Skipping)
+ - `hoodie.metadata.enable` (to enable Metadata Table use on the read path)
+ - `hoodie.metadata.index.column.stats.enable` (to enable Column Stats Index
use on the read path)