prasannarajaperumal commented on code in PR #6268:
URL: https://github.com/apache/hudi/pull/6268#discussion_r942176922


##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,371 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self-Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and 
ensures up-to-date metadata and indexes unlocking multi-fold read and write 
performance optimizations
+- **Engine neutrality** - designed to be neutral and not having a preferred 
computation engine. Apache Hudi will manage metadata, provide common 
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into a high level directory structure 
under the base path (root directory for the Hudi table). The directory 
structure is based on coarse-grained partitioning values set for the dataset. 
Non-partitioned data sets store all the data files under the base path. Hudi 
storage format has a special reserved *.hoodie* directory under the base path 
that is used to store transaction logs and metadata.
+
+```
+/data/hudi_trips/                                      <== BASE PATH
+├── .hoodie/                                           <== META BASE PATH
+│   └── metadata/ 
+├── americas/
+│   ├── brazil/
+│   │   └── sao_paulo/      <== PARTITIONED DIRECTORY 
+│   │       ├── <data_files>
+│   └── united_states/
+│       └── san_francisco/
+│           ├── <data_files>
+└── asia/
+    └── india/
+        └── chennai/
+            ├── <data_files>
+```
+
+Hudi storage format offers two table types offering different trade-offs 
between ingest and query performance and the data files are stored differently 
based on the chosen table type. 
+
+| Table Type    | Trade-off                                                    
|
+| ------------- | ------------------------------------------------------------ 
|
+| Copy on Write | Optimized for read performance and ideal for slow changing 
datasets |
+| Merge-on-read | Optimized to balance the write and read performance and 
ideal for frequently changing datasets |
+
+
+
+### Data Model
+
+Within each partition, data is organized into key-value model. Every row is 
uniquely identified with a row key. To write a row into Hudi dataset, each row 
must specify the following user fields 
+
+| User fields                 | Description                                    
              |
+| --------------------------- | 
------------------------------------------------------------ |
+| Partitioning key [Optional] | Value of this field defines the directory 
hierarchy within the table base path. This essentially provides an hierarchy 
isolation for managing data and related metadata |
+| Row key(s)                     | Record keys uniquely identify a record/row 
within each partition if partitioning is enabled |
+| Ordering field(s)           | Hudi guarantees the uniqueness constraint of 
row key and the conflict resolution configuration manages strategies on how to 
disambiguate when multiple records with the same keys are to be merged into the 
dataset. The resolution logic can be based on an ordering field or can be 
custom, specific to the dataset. To ensure consistent behaviour dealing with 
duplicate records, the resolution logic should be commutative and idempotent |
+
+**Hudi metadata fields**
+
+Hudi format stores the user fields along with the row merged along with 
transactional metadata fields. These fields are encoded in the data-file format 
and available in the table schema. 
+
+| Hudi meta-fields             | Description                                   
               |
+| ---------------------------- | 
------------------------------------------------------------ |
+| _hoodie_commit_time [string] | Every modification to a Hudi dataset creates 
an entry into the Transaction timeline. This entry is identified with the 
commit time. This field matches to the commit time of the instant in the 
timeline that created this record. More on how to populate this in Hudi 
transactions section below. |
+| _hoodie_record_key           | Unique record key identifying the row within 
the partition   |
+| _hoodie_partition_path       | Partition path under which the data is 
organized into        |
+| _hoodie_file_name            | The data file name this record belongs to     
               |
+| _hoodie_is_deleted           | Tombstone field to denote the record key is 
deleted          |
+
+
+
+## Transaction Log (Timeline)
+
+Data consistency in Hudi is provided using Multi-version Concurrency Control 
(MVCC). Every transactional action on the Hudi table creates a new entry 
(instant) in the timeline. All transactional actions follows the state 
transition below
+
+* **requested** - Action is requested to start on the timeline. 
+* **inflight** - Action has started running and is currently in-flight
+* **completed** - Action has completed running
+
+All actions and the state transitions are registered with the timeline using 
an atomic put of special meta-file inside the  *.hoodie* directory. The 
requirement from the underlying storage system is to support an atomic-put and 
read-after-write consistency. The meta file structure is as follows
+
+```
+[Action timestamp].[Action type].[Action state] 
+```
+
+* **Action timestamp** - Monotonically increasing value to denote strict 
ordering of actions in the timeline. This could be provided by an external 
token provider or rely on the system epoch time at millisecond granularity. 
+
+* **Action type** - Type of action. The following are the position actions on 
the Hudi timeline. 
+
+* | Action type   | Description                                                
  |
+  | ------------- | 
------------------------------------------------------------ |
+  | commit        | Commit denotes an **atomic write (inserts, updates and 
deletes)** of records in a table. A commit in Hudi is an atomic way of updating 
data, metadata and indexes. The guarantee is that all or none the changes 
within a commit will be visible to the readers |
+  | deltacommit   | Special version of `commit` which is applicable only on a 
Merge-on-Read storage engine. The writes are accumulated and batched to improve 
write performance |
+  | rollback      | Rollback denotes that the changes made by the 
corresponding commit/delta commit was unsuccessful & hence rolled back, 
removing any partial files produced during such a write |
+  | savepoint     | Savepoint is a special marker to ensure a particular 
commit is not automatically cleaned. It helps restore the table to a point on 
the timeline, in case of disaster/data recovery scenarios |
+  | restore       | Restore denotes that a particular Savepoint was restored   
  |
+  | clean         | Maintenance activity that cleans up versions of data files 
that no longer will be accessed |
+  | compaction    | Maintenance to optimize the storage for query performance. 
This action applies the batched up updates from `deltacommit` and re-optimizes 
data files for query performance |
+  | replacecommit | Maintenance activity to cluster the data for better query 
performance. This action is different from a `commit` in that the the table 
state before and after are logically equivalent |

Review Comment:
   I was trying to highlight - Replace commit does not change the logical state 
of the table (no data change). 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to