[GitHub] [hudi] codope commented on a diff in pull request #7985: [DOCS] Update clustering docs

via GitHub Fri, 17 Feb 2023 05:41:54 -0800


codope commented on code in PR #7985:
URL: https://github.com/apache/hudi/pull/7985#discussion_r1109826709



##########
website/docs/clustering.md:
##########
@@ -10,6 +10,17 @@ last_modified_at:
 Apache Hudi brings stream processing to big data, providing fresh data while 
being an order of magnitude efficient over traditional batch processing. In a 
data lake/warehouse, one of the key trade-offs is between ingestion speed and 
query performance. Data ingestion typically prefers small files to improve 
parallelism and make data available to queries as soon as possible. However, 
query performance degrades poorly with a lot of small files. Also, during 
ingestion, data is typically co-located based on arrival time. However, the 
query engines perform better when the data frequently queried is co-located 
together. In most architectures each of these systems tend to add optimizations 
independently to improve performance which hits limitations due to un-optimized 
data layouts. This doc introduces a new kind of table service called clustering 
[[RFC-19]](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance)
 to reorganize data for i
 mproved query performance without compromising on ingestion speed.
 <!--truncate-->
 
+## How is compaction different from clustering?
+
+Hudi is modeled like a log-structured storage engine with multiple versions of 
the data.
+Particularly, [Merge-On-Read](/docs/table_types#merge-on-read-table)
+tables in Hudi store data using a combination of base file in columnar format 
and row-based delta logs that contain
+updates. Compaction is a way to merge the delta logs with base files to 
produce the latest file slices with the most
+recent snapshot of data. Compaction helps to keep the query performance in 
check (larger delta log files would incur
+longer merge times on query side). On the other hand, clustering is a data 
layout optimization technique. One can stitch
+together small files into larger files using clustering. Additionally, data 
can be clustered by sort key so that queries
+can take advantage of data locality.

Review Comment:
   I don't think we need to mention that explicitly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] codope commented on a diff in pull request #7985: [DOCS] Update clustering docs

Reply via email to