[hudi] branch asf-site updated: [HUDI-3930][Docs] Adding documentation around Data Skipping (#5440)

xushiyan Sun, 01 May 2022 02:54:13 -0700

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 330f2aa05d [HUDI-3930][Docs] Adding documentation around Data Skipping 
(#5440)
330f2aa05d is described below

commit 330f2aa05d40312d764d5331d4ce92c1315f3914
Author: Alexey Kudinkin <[email protected]>
AuthorDate: Sun May 1 02:54:02 2022 -0700

    [HUDI-3930][Docs] Adding documentation around Data Skipping (#5440)
    
    
    Co-authored-by: Raymond Xu <[email protected]>
---
 website/docs/performance.md                        | 77 +++++++++++++++-------
 website/releases/release-0.11.0.md                 |  2 +
 .../versioned_docs/version-0.11.0/performance.md   | 77 +++++++++++++++-------
 3 files changed, 108 insertions(+), 48 deletions(-)

diff --git a/website/docs/performance.md b/website/docs/performance.md
index db78a7f25b..162b7cc85e 100644
--- a/website/docs/performance.md
+++ b/website/docs/performance.md
@@ -23,12 +23,14 @@ Here are some ways to efficiently manage the storage of 
your Hudi tables.
   once created cannot be deleted, but simply expanded as explained before.
 - For workloads with heavy updates, the [merge-on-read 
table](/docs/concepts#merge-on-read-table) provides a nice mechanism for 
ingesting quickly into smaller files and then later merging them into larger 
base files via compaction.
 
-## Performance Gains
+## Performance Optimizations
 
 In this section, we go over some real world performance numbers for Hudi 
upserts, incremental pull and compare them against
-the conventional alternatives for achieving these tasks. 
+the conventional alternatives for achieving these tasks.
 
-### Upserts
+### Write Path
+
+#### Upserts
 
 Following shows the speed up obtained for NoSQL database ingestion, from 
incrementally upserting on a Hudi table on the copy-on-write storage,
 on 5 tables ranging from small to huge (as opposed to bulk loading the tables)
@@ -47,38 +49,65 @@ significant savings on the overall compute cost.
 Hudi upserts have been stress tested upto 4TB in a single commit across the t1 
table. 
 See [here](https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide) for 
some tuning tips.
 
-### Indexing
+#### Indexing
 
-In order to efficiently upsert data, Hudi needs to classify records in a write 
batch into inserts & updates (tagged with the file group 
-it belongs to). In order to speed this operation, Hudi employs a pluggable 
index mechanism that stores a mapping between recordKey and 
+In order to efficiently upsert data, Hudi needs to classify records in a write 
batch into inserts & updates (tagged with the file group
+it belongs to). In order to speed this operation, Hudi employs a pluggable 
index mechanism that stores a mapping between recordKey and
 the file group id it belongs to. By default, Hudi uses a built in index that 
uses file ranges and bloom filters to accomplish this, with
-upto 10x speed up over a spark join to do the same. 
+upto 10x speed up over a spark join to do the same.
 
 Hudi provides best indexing performance when you model the recordKey to be 
monotonically increasing (e.g timestamp prefix), leading to range pruning 
filtering
 out a lot of files for comparison. Even for UUID based keys, there are [known 
techniques](https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/) 
to achieve this.
-For e.g , with 100M timestamp prefixed keys (5% updates, 95% inserts) on a 
event table with 80B keys/3 partitions/11416 files/10TB data, Hudi index 
achieves a 
-**~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a 
challenging workload like an '100% update' database ingestion workload spanning 
+For e.g , with 100M timestamp prefixed keys (5% updates, 95% inserts) on a 
event table with 80B keys/3 partitions/11416 files/10TB data, Hudi index 
achieves a
+**~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a 
challenging workload like an '100% update' database ingestion workload spanning
 3.25B UUID keys/30 partitions/6180 files using 300 cores, Hudi indexing offers 
a **80-100% speedup**.
 
-### Snapshot Queries
 
-The major design goal for snapshot queries is to achieve the latency reduction 
& efficiency gains in previous section,
-with no impact on queries. Following charts compare the Hudi vs non-Hudi 
tables across Hive/Presto/Spark queries and demonstrate this.
+### Read Path
 
-**Hive**
+#### Data Skipping
+ 
+Data Skipping is a technique (originally introduced in Hudi 0.10) that 
leverages files metadata to very effectively prune the search space, by
+avoiding reading (even footers of) the files that are known (based on the 
metadata) to only contain the data that _does not match_ the query's filters.
 
-<figure>
-    <img className="docimage" 
src={require("/assets/images/hudi_query_perf_hive.png").default} 
alt="hudi_query_perf_hive.png"  />
-</figure>
+Data Skipping is leveraging Metadata Table's Column Stats Index bearing 
column-level statistics (such as min-value, max-value, count of null-values in 
the column, etc)
+for every file of the Hudi table. This then allows Hudi for every incoming 
query instead of enumerating every file in the table and reading its 
corresponding metadata
+(for ex, Parquet footers) for analysis whether it could contain any data 
matching the query filters, to simply do a query against a Column Stats Index
+in the Metadata Table (which in turn is a Hudi table itself) and within 
seconds (even for TBs scale tables, with 10s of thousands of files) obtain the 
list
+of _all the files that might potentially contain the data_ matching query's 
filters with crucial property that files that could be ruled out as not 
containing such data
+(based on their column-level statistics) will be stripped out.
 
-**Spark**
+In spirit, Data Skipping is very similar to Partition Pruning for tables using 
Physical Partitioning where records in the dataset are partitioned on disk
+into a folder structure based on some column's value or its derivative 
(clumping records together based on some intrinsic measure), but instead
+of on-disk folder structure, Data Skipping leverages index maintaining a 
mapping "file &rarr; columns' statistics" for all of the columns persisted 
+within that file.
 
-<figure>
-    <img className="docimage" 
src={require("/assets/images/hudi_query_perf_spark.png").default} 
alt="hudi_query_perf_spark.png"  />
-</figure>
+For very large tables (1Tb+, 10s of 1000s of files), Data skipping could
 
-**Presto**
+1. Substantially improve query execution runtime (by avoiding fruitless 
Compute churn) in excess of **10x** as compared to the same query on the same 
dataset but w/o Data Skipping enabled.
+2. Help avoid hitting Cloud Storages throttling limits (for issuing too many 
requests, for ex, AWS limits # of requests / sec that could be issued based on 
the object's prefix which considerably complicates things for partitioned 
tables)
 
-<figure>
-    <img className="docimage" 
src={require("/assets/images/hudi_query_perf_presto.png").default} 
alt="hudi_query_perf_presto.png"  />
-</figure>
+If you're interested in learning more details around how Data Skipping is 
working internally please watch out for a blog-post coming out on this soon!
+
+To unlock the power of Data Skipping you will need to
+
+1. Enable Metadata Table along with Column Stats Index on the _write path_ 
(See [Async Meta Indexing](/docs/async_meta_indexing)).
+2. Enable Data Skipping in your queries
+
+To enable Metadata Table along with Column Stats Index on the write path, make 
sure 
+the following configurations are set to `true`:
+
+  - `hoodie.metadata.enable` (to enable Metadata Table on the write path, 
enabled by default)
+  - `hoodie.metadata.index.column.stats.enable` (to enable Column Stats Index 
being populated on the write path, disabled by default)
+
+:::note
+If you're planning on enabling Column Stats Index for already existing table, 
please check out the [Async Meta Indexing](/docs/async_meta_indexing) guide
+on how to build Metadata Table Indices (such as Column Stats Index) for 
existing tables.
+:::
+
+
+To enable Data Skipping in your queries make sure to set following properties 
to `true` (on the read path): 
+
+  - `hoodie.enable.data.skipping` (to enable Data Skipping)
+  - `hoodie.metadata.enable` (to enable Metadata Table use on the read path)
+  - `hoodie.metadata.index.column.stats.enable` (to enable Column Stats Index 
use on the read path)
diff --git a/website/releases/release-0.11.0.md 
b/website/releases/release-0.11.0.md
index 2916b12b86..0662eeddee 100644
--- a/website/releases/release-0.11.0.md
+++ b/website/releases/release-0.11.0.md
@@ -48,6 +48,8 @@ following: `date_format(ts, "MM/dd/yyyy" ) < "04/01/2022"`.
 *Note: Currently Data Skipping is only supported in COW tables and MOR tables 
in read-optimized mode. The work of full
 support for MOR tables is tracked in 
[HUDI-3866](https://issues.apache.org/jira/browse/HUDI-3866)*
 
+*Refer to the [performance](/docs/performance#read-path) guide for more info.*
+
 ### Async Indexer
 
 In 0.11.0, we added a new asynchronous service for indexing to our rich set of 
table services. It allows users to create
diff --git a/website/versioned_docs/version-0.11.0/performance.md 
b/website/versioned_docs/version-0.11.0/performance.md
index db78a7f25b..162b7cc85e 100644
--- a/website/versioned_docs/version-0.11.0/performance.md
+++ b/website/versioned_docs/version-0.11.0/performance.md
@@ -23,12 +23,14 @@ Here are some ways to efficiently manage the storage of 
your Hudi tables.
   once created cannot be deleted, but simply expanded as explained before.
 - For workloads with heavy updates, the [merge-on-read 
table](/docs/concepts#merge-on-read-table) provides a nice mechanism for 
ingesting quickly into smaller files and then later merging them into larger 
base files via compaction.
 
-## Performance Gains
+## Performance Optimizations
 
 In this section, we go over some real world performance numbers for Hudi 
upserts, incremental pull and compare them against
-the conventional alternatives for achieving these tasks. 
+the conventional alternatives for achieving these tasks.
 
-### Upserts
+### Write Path
+
+#### Upserts
 
 Following shows the speed up obtained for NoSQL database ingestion, from 
incrementally upserting on a Hudi table on the copy-on-write storage,
 on 5 tables ranging from small to huge (as opposed to bulk loading the tables)
@@ -47,38 +49,65 @@ significant savings on the overall compute cost.
 Hudi upserts have been stress tested upto 4TB in a single commit across the t1 
table. 
 See [here](https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide) for 
some tuning tips.
 
-### Indexing
+#### Indexing
 
-In order to efficiently upsert data, Hudi needs to classify records in a write 
batch into inserts & updates (tagged with the file group 
-it belongs to). In order to speed this operation, Hudi employs a pluggable 
index mechanism that stores a mapping between recordKey and 
+In order to efficiently upsert data, Hudi needs to classify records in a write 
batch into inserts & updates (tagged with the file group
+it belongs to). In order to speed this operation, Hudi employs a pluggable 
index mechanism that stores a mapping between recordKey and
 the file group id it belongs to. By default, Hudi uses a built in index that 
uses file ranges and bloom filters to accomplish this, with
-upto 10x speed up over a spark join to do the same. 
+upto 10x speed up over a spark join to do the same.
 
 Hudi provides best indexing performance when you model the recordKey to be 
monotonically increasing (e.g timestamp prefix), leading to range pruning 
filtering
 out a lot of files for comparison. Even for UUID based keys, there are [known 
techniques](https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/) 
to achieve this.
-For e.g , with 100M timestamp prefixed keys (5% updates, 95% inserts) on a 
event table with 80B keys/3 partitions/11416 files/10TB data, Hudi index 
achieves a 
-**~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a 
challenging workload like an '100% update' database ingestion workload spanning 
+For e.g , with 100M timestamp prefixed keys (5% updates, 95% inserts) on a 
event table with 80B keys/3 partitions/11416 files/10TB data, Hudi index 
achieves a
+**~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a 
challenging workload like an '100% update' database ingestion workload spanning
 3.25B UUID keys/30 partitions/6180 files using 300 cores, Hudi indexing offers 
a **80-100% speedup**.
 
-### Snapshot Queries
 
-The major design goal for snapshot queries is to achieve the latency reduction 
& efficiency gains in previous section,
-with no impact on queries. Following charts compare the Hudi vs non-Hudi 
tables across Hive/Presto/Spark queries and demonstrate this.
+### Read Path
 
-**Hive**
+#### Data Skipping
+ 
+Data Skipping is a technique (originally introduced in Hudi 0.10) that 
leverages files metadata to very effectively prune the search space, by
+avoiding reading (even footers of) the files that are known (based on the 
metadata) to only contain the data that _does not match_ the query's filters.
 
-<figure>
-    <img className="docimage" 
src={require("/assets/images/hudi_query_perf_hive.png").default} 
alt="hudi_query_perf_hive.png"  />
-</figure>
+Data Skipping is leveraging Metadata Table's Column Stats Index bearing 
column-level statistics (such as min-value, max-value, count of null-values in 
the column, etc)
+for every file of the Hudi table. This then allows Hudi for every incoming 
query instead of enumerating every file in the table and reading its 
corresponding metadata
+(for ex, Parquet footers) for analysis whether it could contain any data 
matching the query filters, to simply do a query against a Column Stats Index
+in the Metadata Table (which in turn is a Hudi table itself) and within 
seconds (even for TBs scale tables, with 10s of thousands of files) obtain the 
list
+of _all the files that might potentially contain the data_ matching query's 
filters with crucial property that files that could be ruled out as not 
containing such data
+(based on their column-level statistics) will be stripped out.
 
-**Spark**
+In spirit, Data Skipping is very similar to Partition Pruning for tables using 
Physical Partitioning where records in the dataset are partitioned on disk
+into a folder structure based on some column's value or its derivative 
(clumping records together based on some intrinsic measure), but instead
+of on-disk folder structure, Data Skipping leverages index maintaining a 
mapping "file &rarr; columns' statistics" for all of the columns persisted 
+within that file.
 
-<figure>
-    <img className="docimage" 
src={require("/assets/images/hudi_query_perf_spark.png").default} 
alt="hudi_query_perf_spark.png"  />
-</figure>
+For very large tables (1Tb+, 10s of 1000s of files), Data skipping could
 
-**Presto**
+1. Substantially improve query execution runtime (by avoiding fruitless 
Compute churn) in excess of **10x** as compared to the same query on the same 
dataset but w/o Data Skipping enabled.
+2. Help avoid hitting Cloud Storages throttling limits (for issuing too many 
requests, for ex, AWS limits # of requests / sec that could be issued based on 
the object's prefix which considerably complicates things for partitioned 
tables)
 
-<figure>
-    <img className="docimage" 
src={require("/assets/images/hudi_query_perf_presto.png").default} 
alt="hudi_query_perf_presto.png"  />
-</figure>
+If you're interested in learning more details around how Data Skipping is 
working internally please watch out for a blog-post coming out on this soon!
+
+To unlock the power of Data Skipping you will need to
+
+1. Enable Metadata Table along with Column Stats Index on the _write path_ 
(See [Async Meta Indexing](/docs/async_meta_indexing)).
+2. Enable Data Skipping in your queries
+
+To enable Metadata Table along with Column Stats Index on the write path, make 
sure 
+the following configurations are set to `true`:
+
+  - `hoodie.metadata.enable` (to enable Metadata Table on the write path, 
enabled by default)
+  - `hoodie.metadata.index.column.stats.enable` (to enable Column Stats Index 
being populated on the write path, disabled by default)
+
+:::note
+If you're planning on enabling Column Stats Index for already existing table, 
please check out the [Async Meta Indexing](/docs/async_meta_indexing) guide
+on how to build Metadata Table Indices (such as Column Stats Index) for 
existing tables.
+:::
+
+
+To enable Data Skipping in your queries make sure to set following properties 
to `true` (on the read path): 
+
+  - `hoodie.enable.data.skipping` (to enable Data Skipping)
+  - `hoodie.metadata.enable` (to enable Metadata Table use on the read path)
+  - `hoodie.metadata.index.column.stats.enable` (to enable Column Stats Index 
use on the read path)

[hudi] branch asf-site updated: [HUDI-3930][Docs] Adding documentation around Data Skipping (#5440)

Reply via email to