codope commented on code in PR #7985: URL: https://github.com/apache/hudi/pull/7985#discussion_r1109826709
########## website/docs/clustering.md: ########## @@ -10,6 +10,17 @@ last_modified_at: Apache Hudi brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing. In a data lake/warehouse, one of the key trade-offs is between ingestion speed and query performance. Data ingestion typically prefers small files to improve parallelism and make data available to queries as soon as possible. However, query performance degrades poorly with a lot of small files. Also, during ingestion, data is typically co-located based on arrival time. However, the query engines perform better when the data frequently queried is co-located together. In most architectures each of these systems tend to add optimizations independently to improve performance which hits limitations due to un-optimized data layouts. This doc introduces a new kind of table service called clustering [[RFC-19]](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance) to reorganize data for i mproved query performance without compromising on ingestion speed. <!--truncate--> +## How is compaction different from clustering? + +Hudi is modeled like a log-structured storage engine with multiple versions of the data. +Particularly, [Merge-On-Read](/docs/table_types#merge-on-read-table) +tables in Hudi store data using a combination of base file in columnar format and row-based delta logs that contain +updates. Compaction is a way to merge the delta logs with base files to produce the latest file slices with the most +recent snapshot of data. Compaction helps to keep the query performance in check (larger delta log files would incur +longer merge times on query side). On the other hand, clustering is a data layout optimization technique. One can stitch +together small files into larger files using clustering. Additionally, data can be clustered by sort key so that queries +can take advantage of data locality. Review Comment: I don't think we need to mention that explicitly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
