This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 207c1650696 [MINOR] Add specs for log compaction and indexer (#10106)
207c1650696 is described below

commit 207c1650696264c7f3bcac328c2ce00a50e84cfd
Author: Sagar Sumit <[email protected]>
AuthorDate: Thu Nov 16 02:49:20 2023 +0530

    [MINOR] Add specs for log compaction and indexer (#10106)
---
 website/src/pages/tech-specs-1point0.md | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/website/src/pages/tech-specs-1point0.md 
b/website/src/pages/tech-specs-1point0.md
index 85c52025cd2..082de711219 100644
--- a/website/src/pages/tech-specs-1point0.md
+++ b/website/src/pages/tech-specs-1point0.md
@@ -561,7 +561,12 @@ Compaction is the process that efficiently updates a file 
slice (base and log fi
 
 ### Log Compaction
 
-\[WIP\] See 
[RFC-48](https://github.com/apache/hudi/blob/master/rfc/rfc-48/rfc-48.md) for 
now.
+Log compaction is a minor compaction operation that stitches together log 
files into a single large log file, thus
+reducing write amplification. This process involves introducing a new action 
in Hudi called `logcompaction`, on the
+timeline. The log-compacted file is written to the same file group as the log 
files being compacted. Additionally, the
+log-compacted log block header contains the `COMPACTED_BLOCK_TIMES` and the 
log file reader can skip log blocks that
+have been compacted, thus reducing read amplification as well.
+See [RFC-48](https://github.com/apache/hudi/blob/master/rfc/rfc-48/rfc-48.md) 
for more details.
 
 ### Re-writing
 
@@ -597,8 +602,22 @@ Apache Hudi provides snapshot isolation between writers 
and readers by managing
 
 ### Indexing
 
-\[WIP\] See 
[RFC-45](https://github.com/apache/hudi/blob/master/rfc/rfc-45/rfc-45.md) for 
now.
-
+Indexing is an asynchronous process to create indexes on the table without 
blocking ingestion writers. The indexing is
+divided into two phases: scheduling and execution. During scheduling, the 
indexer takes a lock for a short duration and
+generates an indexing plan for data files based off of a snapshot. Index plan 
is serialized as avro bytes and stored in `[instant].indexing.requested` file 
in the timeline.
+Here's the schema for index plan
+
+| Field               | Description                                            
                                                                                
                                          |
+|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| version             | Index plan version. Current version is 1. It is 
updated whenever the index plan format changes                                  
                                                 |
+| indexPartitionInfos | An array of `HoodieIndexPartitionInfo`s. Each array 
element consists of metadata partition path, base instant off of which indexer 
started and an optional map of extra metadata |
+
+During execution, the indexer executes the plan, writing the index base file 
in the metadata table. Any ongoing commit
+update the index in log files under the same filegroup. After writing the base 
file, the indexer checks for all
+completed commit instants after `t` to ensure each of them added entries per 
its indexing plan, otherwise simply abort
+gracefully. Finally, when the indexing is complete, the indexer writes the 
`[instant].indexing` to the timeline. Indexer
+only takes a lock while adding events to the timeline and not while writing 
index files.
+See [RFC-45](https://github.com/apache/hudi/blob/master/rfc/rfc-45/rfc-45.md) 
for now.
 
 
 ## Compatibility

Reply via email to