kywe665 commented on a change in pull request #4010:
URL: https://github.com/apache/hudi/pull/4010#discussion_r754745374
##########
File path: website/docs/compaction.md
##########
@@ -1,33 +1,26 @@
---
title: Compaction
-summary: "In this page, we describe async compaction in Hudi."
toc: true
last_modified_at:
---
-For Merge-On-Read table, data is stored using a combination of columnar (e.g
parquet) + row based (e.g avro) file formats.
-Updates are logged to delta files & later compacted to produce new versions of
columnar files synchronously or
-asynchronously. One of the main motivations behind Merge-On-Read is to reduce
data latency when ingesting records.
-Hence, it makes sense to run compaction asynchronously without blocking
ingestion.
-
+Compaction is executed asynchronously with Hudi by default.
## Async Compaction
-
Async Compaction is performed in 2 steps:
1. ***Compaction Scheduling***: This is done by the ingestion job. In this
step, Hudi scans the partitions and selects **file
slices** to be compacted. A compaction plan is finally written to Hudi
timeline.
1. ***Compaction Execution***: A separate process reads the compaction plan
and performs compaction of file slices.
+## Scheduling Async Compaction
-## Deployment Models
-
-There are few ways by which we can execute compactions asynchronously.
+There are few ways by which we can schedule compactions to the Hudi timeline
to be executed later asynchronously.
-### Spark Structured Streaming
+### Schedule compaction with Spark Structured Streaming
Review comment:
nice catch, I reverted these changes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]