bhasudha commented on a change in pull request #4010:
URL: https://github.com/apache/hudi/pull/4010#discussion_r754706629
##########
File path: website/docs/compaction.md
##########
@@ -1,33 +1,26 @@
---
title: Compaction
-summary: "In this page, we describe async compaction in Hudi."
toc: true
last_modified_at:
---
-For Merge-On-Read table, data is stored using a combination of columnar (e.g
parquet) + row based (e.g avro) file formats.
-Updates are logged to delta files & later compacted to produce new versions of
columnar files synchronously or
-asynchronously. One of the main motivations behind Merge-On-Read is to reduce
data latency when ingesting records.
-Hence, it makes sense to run compaction asynchronously without blocking
ingestion.
-
+Compaction is executed asynchronously with Hudi by default.
## Async Compaction
-
Async Compaction is performed in 2 steps:
1. ***Compaction Scheduling***: This is done by the ingestion job. In this
step, Hudi scans the partitions and selects **file
slices** to be compacted. A compaction plan is finally written to Hudi
timeline.
1. ***Compaction Execution***: A separate process reads the compaction plan
and performs compaction of file slices.
+## Scheduling Async Compaction
-## Deployment Models
-
-There are few ways by which we can execute compactions asynchronously.
+There are few ways by which we can schedule compactions to the Hudi timeline
to be executed later asynchronously.
-### Spark Structured Streaming
+### Schedule compaction with Spark Structured Streaming
Review comment:
@kywe665 I feel this is bit confusing. Both Spark Structured Streaming
and DeltaStreamer Continuous Mode allows you to run async compactions
(scheduled and executed internally). From users perspective they dont need to
schedule and execute separately from CLI or compactor script later when they
are using Deltastreamer or Spark Streaming. CLI and Compactor scripts are
other utilities to run compactions asynchronously.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]