[GitHub] [hudi] bhasudha commented on a change in pull request #4010: [HUDI-2770] - Docs for (HUDI-2737) - Use earliest instant for async compaction and clustering

GitBox Mon, 22 Nov 2021 15:21:37 -0800


bhasudha commented on a change in pull request #4010:
URL: https://github.com/apache/hudi/pull/4010#discussion_r754706629




##########
File path: website/docs/compaction.md
##########
@@ -1,33 +1,26 @@
 ---
 title: Compaction
-summary: "In this page, we describe async compaction in Hudi."
 toc: true
 last_modified_at:
 ---
 
-For Merge-On-Read table, data is stored using a combination of columnar (e.g 
parquet) + row based (e.g avro) file formats.
-Updates are logged to delta files & later compacted to produce new versions of 
columnar files synchronously or
-asynchronously. One of the main motivations behind Merge-On-Read is to reduce 
data latency when ingesting records.
-Hence, it makes sense to run compaction asynchronously without blocking 
ingestion.
-
+Compaction is executed asynchronously with Hudi by default.
 
 ## Async Compaction
-
 Async Compaction is performed in 2 steps:
 
 1. ***Compaction Scheduling***: This is done by the ingestion job. In this 
step, Hudi scans the partitions and selects **file
    slices** to be compacted. A compaction plan is finally written to Hudi 
timeline.
 1. ***Compaction Execution***: A separate process reads the compaction plan 
and performs compaction of file slices.
 
+## Scheduling Async Compaction
 
-## Deployment Models
-
-There are few ways by which we can execute compactions asynchronously.
+There are few ways by which we can schedule compactions to the Hudi timeline 
to be executed later asynchronously.
 
-### Spark Structured Streaming
+### Schedule compaction with Spark Structured Streaming

Review comment:
       @kywe665  I feel this is bit confusing. Both Spark Structured Streaming 
and DeltaStreamer Continuous Mode  allows you to run async compactions 
(scheduled and executed internally).  From users perspective they dont need to 
schedule and execute separately from CLI or compactor script later when they 
are using Deltastreamer or Spark Streaming.  CLI and Compactor scripts are 
other utilities to run compactions asynchronously.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] bhasudha commented on a change in pull request #4010: [HUDI-2770] - Docs for (HUDI-2737) - Use earliest instant for async compaction and clustering

Reply via email to