[GitHub] [hudi] nsivabalan commented on a diff in pull request #5440: [HUDI-3930][Docs] Adding documentation around Data Skipping

GitBox Thu, 28 Apr 2022 18:14:59 -0700


nsivabalan commented on code in PR #5440:
URL: https://github.com/apache/hudi/pull/5440#discussion_r861409033



##########
website/docs/performance.md:
##########
@@ -60,25 +62,48 @@ For e.g , with 100M timestamp prefixed keys (5% updates, 
95% inserts) on a event
 **~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a 
challenging workload like an '100% update' database ingestion workload spanning 
 3.25B UUID keys/30 partitions/6180 files using 300 cores, Hudi indexing offers 
a **80-100% speedup**.
 
-### Snapshot Queries
 
-The major design goal for snapshot queries is to achieve the latency reduction 
& efficiency gains in previous section,
-with no impact on queries. Following charts compare the Hudi vs non-Hudi 
tables across Hive/Presto/Spark queries and demonstrate this.
+### Read Path
 
-**Hive**
+#### Data Skipping
+ 
+Data Skipping is a technique (originally introduced in Hudi 0.10) that 
leverages files metadata to very effectively prune the search space, by 
+avoiding reading (even footers of) the files that are known (based on the 
metadata) to only contain the data that _does not match_ the query's filters.
 
-<figure>
-    <img className="docimage" 
src={require("/assets/images/hudi_query_perf_hive.png").default} 
alt="hudi_query_perf_hive.png"  />
-</figure>
+Data Skipping is leveraging Metadata Table's Column Stats Index bearing 
column-level statistics (such as min-value, max-value, count of null-values in 
the column, etc)

Review Comment:
   I guess this is available only for COW right? if yes, should we call that 
out as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on a diff in pull request #5440: [HUDI-3930][Docs] Adding documentation around Data Skipping

Reply via email to