[GitHub] [hudi] codope commented on a diff in pull request #6268: [HUDI-4519] Initial version of the Hudi storage format specification doc

GitBox Thu, 04 Aug 2022 05:48:22 -0700


codope commented on code in PR #6268:
URL: https://github.com/apache/hudi/pull/6268#discussion_r937674006



##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,350 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and 
ensures up-to-date metadata and indexes unlocking multi-fold read and write 
performance optimizations
+- **Data processing engine neutral** - designed to be neutral and not having a 
preferred computation engine. Apache Hudi will manage metadata, provide common 
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into optional high level directory 
structure under the base path (root directory for the Hudi table). The 
directory structure is based on coarse grained partitioning values set for the 
dataset. Non partitioned data sets store all the data files under the base 
path. Hudi storage format has a special reserved *.hoodie* directory under the 
base path that is used to store transaction logs and metadata.
+
+```
+/data/hudi_trips/

Review Comment:
   Should we mention what's base path, the partition values and metadata table 
in this example?



##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,350 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and 
ensures up-to-date metadata and indexes unlocking multi-fold read and write 
performance optimizations
+- **Data processing engine neutral** - designed to be neutral and not having a 
preferred computation engine. Apache Hudi will manage metadata, provide common 
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into optional high level directory 
structure under the base path (root directory for the Hudi table). The 
directory structure is based on coarse grained partitioning values set for the 
dataset. Non partitioned data sets store all the data files under the base 
path. Hudi storage format has a special reserved *.hoodie* directory under the 
base path that is used to store transaction logs and metadata.
+
+```
+/data/hudi_trips/
+├── .hoodie/
+│   └── metadata/
+├── americas/
+│   ├── brazil/
+│   │   └── sao_paulo/
+│   │       ├── <data_files>
+│   └── united_states/
+│       └── san_francisco/
+│           ├── <data_files>
+└── asia/
+    └── india/
+        └── chennai/
+            ├── <data_files>
+```
+
+Hudi storage format offers two table types offering different trade-offs 
between ingest and query performance and the data files are stored differently 
based on the chosen table type. 
+
+| Table Type    | Tradeoff                                                     
|
+| ------------- | ------------------------------------------------------------ 
|
+| Copy on Write | Optimized for read performance and ideal for slow changing 
datasets |
+| Merge-on-read | Optimized to balance the write and read performance and 
ideal for frequently changing datasets |
+
+
+
+### Data Model
+
+Within each partition, data is organized into key-value model. Every row is 
uniquely identified with a row key. To write a row into Hudi dataset, each row 
must specify the following user fields 
+
+| User fields                 | Description                                    
              |
+| --------------------------- | 
------------------------------------------------------------ |
+| Partitioning key [Optional] | Value of this field defines the directory 
hierarchy within the table base path. This essentially provides an hierarchy 
isolation for managing data and related metadata |

Review Comment:
   Since it is optional, shall we mention it in the end?



##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,350 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and 
ensures up-to-date metadata and indexes unlocking multi-fold read and write 
performance optimizations
+- **Data processing engine neutral** - designed to be neutral and not having a 
preferred computation engine. Apache Hudi will manage metadata, provide common 
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into optional high level directory 
structure under the base path (root directory for the Hudi table). The 
directory structure is based on coarse grained partitioning values set for the 
dataset. Non partitioned data sets store all the data files under the base 
path. Hudi storage format has a special reserved *.hoodie* directory under the 
base path that is used to store transaction logs and metadata.
+
+```
+/data/hudi_trips/
+├── .hoodie/
+│   └── metadata/
+├── americas/
+│   ├── brazil/
+│   │   └── sao_paulo/
+│   │       ├── <data_files>
+│   └── united_states/
+│       └── san_francisco/
+│           ├── <data_files>
+└── asia/
+    └── india/
+        └── chennai/
+            ├── <data_files>
+```
+
+Hudi storage format offers two table types offering different trade-offs 
between ingest and query performance and the data files are stored differently 
based on the chosen table type. 
+
+| Table Type    | Tradeoff                                                     
|
+| ------------- | ------------------------------------------------------------ 
|
+| Copy on Write | Optimized for read performance and ideal for slow changing 
datasets |
+| Merge-on-read | Optimized to balance the write and read performance and 
ideal for frequently changing datasets |
+
+
+
+### Data Model
+
+Within each partition, data is organized into key-value model. Every row is 
uniquely identified with a row key. To write a row into Hudi dataset, each row 
must specify the following user fields 
+
+| User fields                 | Description                                    
              |
+| --------------------------- | 
------------------------------------------------------------ |
+| Partitioning key [Optional] | Value of this field defines the directory 
hierarchy within the table base path. This essentially provides an hierarchy 
isolation for managing data and related metadata |
+| Row key                     | Record keys uniquely identify a record/row 
within each partition if partitioning is enabled |
+| Ordering field(s)           | Hudi guarantees the uniqueness constraint of 
row key and the conflict resolution configuration manages strategies on how to 
disambiguate when multiple records with the same keys are to be merged into the 
dataset. The resolution logic can be based on an ordering field or can be 
custom, specific to the dataset. To ensure consistent behavior dealing with 
duplicate records, the resolution logic should be commutative and idempotent |
+
+**Hudi metadata fields**
+
+Hudi format stores the user fields along with the row merged along with 
transactional metadata fields. These fields are encoded in the data-file format 
and available in the table schema. 
+
+[TODO - flush this section out with details on guarantees and how to populate 
them]
+
+| Husi meta-fields              | Description                                  
                |
+| ----------------------------- | 
------------------------------------------------------------ |
+| _hoodie_commit_time [string]  | Every modification to a Hudi dataset creates 
an entry into the Transaction timeline. This entry is identified with the 
commit time. This field matches to the commit time of the instant in the 
timeline that created this record. More on this in Hudi transactions section 
below. |
+| _hoodie_commit_seqno [string] | A unique sequencing number identifying the 
position of the record within each commit [TODO - where do we use this?] |
+| _hoodie_record_key            | Unique record key identifying the row within 
the partition   |
+| _hoodie_partition_path        | Partition path under which the data is 
organized into        |
+| _hoodie_file_name             | The data file name this record belongs to    
                |
+| _hoodie_is_deleted            | Tombstone field to denote the record key is 
deleted          |
+| _hoodie_operation             | [TODO] Is this a CDC Field? Not sure where 
this is used.     |
+
+
+
+## Transaction Log (Timeline)
+
+Data consistency in Hudi is provided using Multi-version Concurrency Control 
(MVCC). Every transactional action on the Hudi table creates a new entry 
(instant) in the timeline. All transactional actions follows the state 
transition below
+
+* **requested** - Action is requested to start on the timeline
+* **inflight** - Action has started running and is currently in-flight
+* **completed** - Action has completed running
+
+All actions and the state transitions are registered with the timeline using 
an atomic put of special meta-file inside the  *.hoodie* directory. The 
requirement from the underlying storage system is to support an atomic-put and 
read-after-write consistency. The meta file structure is as follows
+
+```
+[Action timestamp].[Action type].[Action state] 
+```
+
+* **Action timestamp** - Monotonically increasing value to denote strict 
ordering of actions in the timeline. This could be provided by an external 
token provider or rely on the system epoch time at millisecond granularity. 
+
+* **Action type** - Type of action. The following are the possition actions on 
the hudi timeline. 
+
+* | Action type   | Description                                                
  |
+  | ------------- | 
------------------------------------------------------------ |
+  | commit        | Commit denotes an **atomic write (inserts, updates and 
deletes)** of records in a table. A commit in Hudi is an atomic way of updating 
data, metadata and indexes. The guarantee is that all or none the changes 
within a commit will be visible to the readers |
+  | deltacommit   | Special version of `commit` which is applicable only on a 
Merge-on-Read storage engine. The writes are accumulated and batched to improve 
write performance |
+  | rollback      | Rollback denotes that the changes made by the 
corresponding commit/delta commit was unsuccessful & hence rolled back, 
removing any partial files produced during such a write |
+  | savepoint     | Savepoint is a special marker to ensure a particular 
commit is not automatically cleaned. It helps restore the table to a point on 
the timeline, in case of disaster/data recovery scenarios |
+  | restore       | Restore denotes that a particular Savepoint was restored   
  |
+  | clean         | Maintenance activity that cleans up versions of data files 
that no longer will be accessed |
+  | compaction    | Maintenance to optimize the storage for query performance. 
This action applies the batched up updates from `deltacommit` and re-optimizes 
data files for query performance |
+  | replacecommit | Maintenance activity to cluster the data for better query 
performance. This action is different from a `commit` in that the the table 
state before and after are logically equivalent |
+  | indexing      | [todo]                                                     
  |
+  | schemacommit  | [todo]                                                     
  |

Review Comment:
   This action denotes persistence of merged schema under `.hoodie/schema`. It 
only happens when full schema evolution feature is enabled. Since that feature 
is still experimental, it is disabled by default. This makes hoodie metadata 
like a schema registry as well. So, we have full history schema evolution. In 
the future, plan is to move this inside `.hoodie/metadata`. @xiarixiaoyao can 
add more details.



##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,350 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and 
ensures up-to-date metadata and indexes unlocking multi-fold read and write 
performance optimizations
+- **Data processing engine neutral** - designed to be neutral and not having a 
preferred computation engine. Apache Hudi will manage metadata, provide common 
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into optional high level directory 
structure under the base path (root directory for the Hudi table). The 
directory structure is based on coarse grained partitioning values set for the 
dataset. Non partitioned data sets store all the data files under the base 
path. Hudi storage format has a special reserved *.hoodie* directory under the 
base path that is used to store transaction logs and metadata.
+
+```
+/data/hudi_trips/
+├── .hoodie/
+│   └── metadata/
+├── americas/
+│   ├── brazil/
+│   │   └── sao_paulo/
+│   │       ├── <data_files>
+│   └── united_states/
+│       └── san_francisco/
+│           ├── <data_files>
+└── asia/
+    └── india/
+        └── chennai/
+            ├── <data_files>
+```
+
+Hudi storage format offers two table types offering different trade-offs 
between ingest and query performance and the data files are stored differently 
based on the chosen table type. 
+
+| Table Type    | Tradeoff                                                     
|
+| ------------- | ------------------------------------------------------------ 
|
+| Copy on Write | Optimized for read performance and ideal for slow changing 
datasets |
+| Merge-on-read | Optimized to balance the write and read performance and 
ideal for frequently changing datasets |
+
+
+
+### Data Model
+
+Within each partition, data is organized into key-value model. Every row is 
uniquely identified with a row key. To write a row into Hudi dataset, each row 
must specify the following user fields 
+
+| User fields                 | Description                                    
              |
+| --------------------------- | 
------------------------------------------------------------ |
+| Partitioning key [Optional] | Value of this field defines the directory 
hierarchy within the table base path. This essentially provides an hierarchy 
isolation for managing data and related metadata |
+| Row key                     | Record keys uniquely identify a record/row 
within each partition if partitioning is enabled |
+| Ordering field(s)           | Hudi guarantees the uniqueness constraint of 
row key and the conflict resolution configuration manages strategies on how to 
disambiguate when multiple records with the same keys are to be merged into the 
dataset. The resolution logic can be based on an ordering field or can be 
custom, specific to the dataset. To ensure consistent behavior dealing with 
duplicate records, the resolution logic should be commutative and idempotent |
+
+**Hudi metadata fields**
+
+Hudi format stores the user fields along with the row merged along with 
transactional metadata fields. These fields are encoded in the data-file format 
and available in the table schema. 
+
+[TODO - flush this section out with details on guarantees and how to populate 
them]
+
+| Husi meta-fields              | Description                                  
                |
+| ----------------------------- | 
------------------------------------------------------------ |
+| _hoodie_commit_time [string]  | Every modification to a Hudi dataset creates 
an entry into the Transaction timeline. This entry is identified with the 
commit time. This field matches to the commit time of the instant in the 
timeline that created this record. More on this in Hudi transactions section 
below. |
+| _hoodie_commit_seqno [string] | A unique sequencing number identifying the 
position of the record within each commit [TODO - where do we use this?] |
+| _hoodie_record_key            | Unique record key identifying the row within 
the partition   |
+| _hoodie_partition_path        | Partition path under which the data is 
organized into        |
+| _hoodie_file_name             | The data file name this record belongs to    
                |
+| _hoodie_is_deleted            | Tombstone field to denote the record key is 
deleted          |
+| _hoodie_operation             | [TODO] Is this a CDC Field? Not sure where 
this is used.     |

Review Comment:
   Yes, this is for CDC support in Flink. If changelog is enabled in Flink, 
then this field could take one of the values in `HoodieOperation` and it is 
used in merging log files while reading hudi table in flink 
(`MergeOnReadInputFormat`). @danny0405 can add more details here.



##########
content/assets/images/spec/spec_log_format_avro_block.excalidraw:
##########
@@ -0,0 +1,497 @@
+{
+  "type": "excalidraw",

Review Comment:
   Why do we need these excalidraw files if png is already there?



##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,350 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and 
ensures up-to-date metadata and indexes unlocking multi-fold read and write 
performance optimizations
+- **Data processing engine neutral** - designed to be neutral and not having a 
preferred computation engine. Apache Hudi will manage metadata, provide common 
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into optional high level directory 
structure under the base path (root directory for the Hudi table). The 
directory structure is based on coarse grained partitioning values set for the 
dataset. Non partitioned data sets store all the data files under the base 
path. Hudi storage format has a special reserved *.hoodie* directory under the 
base path that is used to store transaction logs and metadata.

Review Comment:
   ```suggestion
   At a high level, Hudi organizes data into a high level directory structure 
under the base path (root directory for the Hudi table). The directory 
structure is based on coarse-grained partitioning values set for the dataset. 
Non-partitioned data sets store all the data files under the base path. Hudi 
storage format has a special reserved *.hoodie* directory under the base path 
that is used to store transaction logs and metadata.
   ```



##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,350 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and 
ensures up-to-date metadata and indexes unlocking multi-fold read and write 
performance optimizations
+- **Data processing engine neutral** - designed to be neutral and not having a 
preferred computation engine. Apache Hudi will manage metadata, provide common 
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into optional high level directory 
structure under the base path (root directory for the Hudi table). The 
directory structure is based on coarse grained partitioning values set for the 
dataset. Non partitioned data sets store all the data files under the base 
path. Hudi storage format has a special reserved *.hoodie* directory under the 
base path that is used to store transaction logs and metadata.
+
+```
+/data/hudi_trips/
+├── .hoodie/
+│   └── metadata/
+├── americas/
+│   ├── brazil/
+│   │   └── sao_paulo/
+│   │       ├── <data_files>
+│   └── united_states/
+│       └── san_francisco/
+│           ├── <data_files>
+└── asia/
+    └── india/
+        └── chennai/
+            ├── <data_files>
+```
+
+Hudi storage format offers two table types offering different trade-offs 
between ingest and query performance and the data files are stored differently 
based on the chosen table type. 
+
+| Table Type    | Tradeoff                                                     
|
+| ------------- | ------------------------------------------------------------ 
|
+| Copy on Write | Optimized for read performance and ideal for slow changing 
datasets |
+| Merge-on-read | Optimized to balance the write and read performance and 
ideal for frequently changing datasets |
+
+
+
+### Data Model
+
+Within each partition, data is organized into key-value model. Every row is 
uniquely identified with a row key. To write a row into Hudi dataset, each row 
must specify the following user fields 
+
+| User fields                 | Description                                    
              |
+| --------------------------- | 
------------------------------------------------------------ |
+| Partitioning key [Optional] | Value of this field defines the directory 
hierarchy within the table base path. This essentially provides an hierarchy 
isolation for managing data and related metadata |
+| Row key                     | Record keys uniquely identify a record/row 
within each partition if partitioning is enabled |
+| Ordering field(s)           | Hudi guarantees the uniqueness constraint of 
row key and the conflict resolution configuration manages strategies on how to 
disambiguate when multiple records with the same keys are to be merged into the 
dataset. The resolution logic can be based on an ordering field or can be 
custom, specific to the dataset. To ensure consistent behavior dealing with 
duplicate records, the resolution logic should be commutative and idempotent |
+
+**Hudi metadata fields**
+
+Hudi format stores the user fields along with the row merged along with 
transactional metadata fields. These fields are encoded in the data-file format 
and available in the table schema. 
+
+[TODO - flush this section out with details on guarantees and how to populate 
them]

Review Comment:
   What kind of guarantees? Is it about ordering and duplication in 
single-writer and multi-writer scenarios? I think that would be better aligned 
under a separate section on concurrency control.



##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,350 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and 
ensures up-to-date metadata and indexes unlocking multi-fold read and write 
performance optimizations
+- **Data processing engine neutral** - designed to be neutral and not having a 
preferred computation engine. Apache Hudi will manage metadata, provide common 
abstractions and pluggable interfaces to most/all common computational engines.

Review Comment:
   ```suggestion
   - **Engine-neutral Data Processing** - designed to be neutral and not having 
a preferred computation engine. Apache Hudi will manage metadata, provide 
common abstractions and pluggable interfaces to most/all common computational 
engines.
   ```



##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,350 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and 
ensures up-to-date metadata and indexes unlocking multi-fold read and write 
performance optimizations
+- **Data processing engine neutral** - designed to be neutral and not having a 
preferred computation engine. Apache Hudi will manage metadata, provide common 
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into optional high level directory 
structure under the base path (root directory for the Hudi table). The 
directory structure is based on coarse grained partitioning values set for the 
dataset. Non partitioned data sets store all the data files under the base 
path. Hudi storage format has a special reserved *.hoodie* directory under the 
base path that is used to store transaction logs and metadata.
+
+```
+/data/hudi_trips/
+├── .hoodie/
+│   └── metadata/
+├── americas/
+│   ├── brazil/
+│   │   └── sao_paulo/
+│   │       ├── <data_files>
+│   └── united_states/
+│       └── san_francisco/
+│           ├── <data_files>
+└── asia/
+    └── india/
+        └── chennai/
+            ├── <data_files>
+```
+
+Hudi storage format offers two table types offering different trade-offs 
between ingest and query performance and the data files are stored differently 
based on the chosen table type. 
+
+| Table Type    | Tradeoff                                                     
|
+| ------------- | ------------------------------------------------------------ 
|
+| Copy on Write | Optimized for read performance and ideal for slow changing 
datasets |
+| Merge-on-read | Optimized to balance the write and read performance and 
ideal for frequently changing datasets |
+
+
+
+### Data Model
+
+Within each partition, data is organized into key-value model. Every row is 
uniquely identified with a row key. To write a row into Hudi dataset, each row 
must specify the following user fields 
+
+| User fields                 | Description                                    
              |
+| --------------------------- | 
------------------------------------------------------------ |
+| Partitioning key [Optional] | Value of this field defines the directory 
hierarchy within the table base path. This essentially provides an hierarchy 
isolation for managing data and related metadata |
+| Row key                     | Record keys uniquely identify a record/row 
within each partition if partitioning is enabled |
+| Ordering field(s)           | Hudi guarantees the uniqueness constraint of 
row key and the conflict resolution configuration manages strategies on how to 
disambiguate when multiple records with the same keys are to be merged into the 
dataset. The resolution logic can be based on an ordering field or can be 
custom, specific to the dataset. To ensure consistent behavior dealing with 
duplicate records, the resolution logic should be commutative and idempotent |
+
+**Hudi metadata fields**
+
+Hudi format stores the user fields along with the row merged along with 
transactional metadata fields. These fields are encoded in the data-file format 
and available in the table schema. 
+
+[TODO - flush this section out with details on guarantees and how to populate 
them]
+
+| Husi meta-fields              | Description                                  
                |
+| ----------------------------- | 
------------------------------------------------------------ |
+| _hoodie_commit_time [string]  | Every modification to a Hudi dataset creates 
an entry into the Transaction timeline. This entry is identified with the 
commit time. This field matches to the commit time of the instant in the 
timeline that created this record. More on this in Hudi transactions section 
below. |
+| _hoodie_commit_seqno [string] | A unique sequencing number identifying the 
position of the record within each commit [TODO - where do we use this?] |
+| _hoodie_record_key            | Unique record key identifying the row within 
the partition   |
+| _hoodie_partition_path        | Partition path under which the data is 
organized into        |
+| _hoodie_file_name             | The data file name this record belongs to    
                |
+| _hoodie_is_deleted            | Tombstone field to denote the record key is 
deleted          |
+| _hoodie_operation             | [TODO] Is this a CDC Field? Not sure where 
this is used.     |
+
+
+
+## Transaction Log (Timeline)
+
+Data consistency in Hudi is provided using Multi-version Concurrency Control 
(MVCC). Every transactional action on the Hudi table creates a new entry 
(instant) in the timeline. All transactional actions follows the state 
transition below
+
+* **requested** - Action is requested to start on the timeline
+* **inflight** - Action has started running and is currently in-flight
+* **completed** - Action has completed running
+
+All actions and the state transitions are registered with the timeline using 
an atomic put of special meta-file inside the  *.hoodie* directory. The 
requirement from the underlying storage system is to support an atomic-put and 
read-after-write consistency. The meta file structure is as follows
+
+```
+[Action timestamp].[Action type].[Action state] 
+```
+
+* **Action timestamp** - Monotonically increasing value to denote strict 
ordering of actions in the timeline. This could be provided by an external 
token provider or rely on the system epoch time at millisecond granularity. 
+
+* **Action type** - Type of action. The following are the possition actions on 
the hudi timeline. 
+
+* | Action type   | Description                                                
  |
+  | ------------- | 
------------------------------------------------------------ |
+  | commit        | Commit denotes an **atomic write (inserts, updates and 
deletes)** of records in a table. A commit in Hudi is an atomic way of updating 
data, metadata and indexes. The guarantee is that all or none the changes 
within a commit will be visible to the readers |
+  | deltacommit   | Special version of `commit` which is applicable only on a 
Merge-on-Read storage engine. The writes are accumulated and batched to improve 
write performance |
+  | rollback      | Rollback denotes that the changes made by the 
corresponding commit/delta commit was unsuccessful & hence rolled back, 
removing any partial files produced during such a write |
+  | savepoint     | Savepoint is a special marker to ensure a particular 
commit is not automatically cleaned. It helps restore the table to a point on 
the timeline, in case of disaster/data recovery scenarios |
+  | restore       | Restore denotes that a particular Savepoint was restored   
  |
+  | clean         | Maintenance activity that cleans up versions of data files 
that no longer will be accessed |
+  | compaction    | Maintenance to optimize the storage for query performance. 
This action applies the batched up updates from `deltacommit` and re-optimizes 
data files for query performance |
+  | replacecommit | Maintenance activity to cluster the data for better query 
performance. This action is different from a `commit` in that the the table 
state before and after are logically equivalent |
+  | indexing      | [todo]                                                     
  |

Review Comment:
   Denotes asynchronous indexing (files, bloom or column stats index) w/o 
blocking ingestion.



##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,350 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and 
ensures up-to-date metadata and indexes unlocking multi-fold read and write 
performance optimizations
+- **Data processing engine neutral** - designed to be neutral and not having a 
preferred computation engine. Apache Hudi will manage metadata, provide common 
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into optional high level directory 
structure under the base path (root directory for the Hudi table). The 
directory structure is based on coarse grained partitioning values set for the 
dataset. Non partitioned data sets store all the data files under the base 
path. Hudi storage format has a special reserved *.hoodie* directory under the 
base path that is used to store transaction logs and metadata.
+
+```
+/data/hudi_trips/
+├── .hoodie/
+│   └── metadata/
+├── americas/
+│   ├── brazil/
+│   │   └── sao_paulo/
+│   │       ├── <data_files>
+│   └── united_states/
+│       └── san_francisco/
+│           ├── <data_files>
+└── asia/
+    └── india/
+        └── chennai/
+            ├── <data_files>
+```
+
+Hudi storage format offers two table types offering different trade-offs 
between ingest and query performance and the data files are stored differently 
based on the chosen table type. 
+
+| Table Type    | Tradeoff                                                     
|
+| ------------- | ------------------------------------------------------------ 
|
+| Copy on Write | Optimized for read performance and ideal for slow changing 
datasets |
+| Merge-on-read | Optimized to balance the write and read performance and 
ideal for frequently changing datasets |
+
+
+
+### Data Model
+
+Within each partition, data is organized into key-value model. Every row is 
uniquely identified with a row key. To write a row into Hudi dataset, each row 
must specify the following user fields 
+
+| User fields                 | Description                                    
              |
+| --------------------------- | 
------------------------------------------------------------ |
+| Partitioning key [Optional] | Value of this field defines the directory 
hierarchy within the table base path. This essentially provides an hierarchy 
isolation for managing data and related metadata |
+| Row key                     | Record keys uniquely identify a record/row 
within each partition if partitioning is enabled |
+| Ordering field(s)           | Hudi guarantees the uniqueness constraint of 
row key and the conflict resolution configuration manages strategies on how to 
disambiguate when multiple records with the same keys are to be merged into the 
dataset. The resolution logic can be based on an ordering field or can be 
custom, specific to the dataset. To ensure consistent behavior dealing with 
duplicate records, the resolution logic should be commutative and idempotent |
+
+**Hudi metadata fields**
+
+Hudi format stores the user fields along with the row merged along with 
transactional metadata fields. These fields are encoded in the data-file format 
and available in the table schema. 
+
+[TODO - flush this section out with details on guarantees and how to populate 
them]
+
+| Husi meta-fields              | Description                                  
                |
+| ----------------------------- | 
------------------------------------------------------------ |
+| _hoodie_commit_time [string]  | Every modification to a Hudi dataset creates 
an entry into the Transaction timeline. This entry is identified with the 
commit time. This field matches to the commit time of the instant in the 
timeline that created this record. More on this in Hudi transactions section 
below. |
+| _hoodie_commit_seqno [string] | A unique sequencing number identifying the 
position of the record within each commit [TODO - where do we use this?] |
+| _hoodie_record_key            | Unique record key identifying the row within 
the partition   |
+| _hoodie_partition_path        | Partition path under which the data is 
organized into        |
+| _hoodie_file_name             | The data file name this record belongs to    
                |
+| _hoodie_is_deleted            | Tombstone field to denote the record key is 
deleted          |
+| _hoodie_operation             | [TODO] Is this a CDC Field? Not sure where 
this is used.     |
+
+
+
+## Transaction Log (Timeline)
+
+Data consistency in Hudi is provided using Multi-version Concurrency Control 
(MVCC). Every transactional action on the Hudi table creates a new entry 
(instant) in the timeline. All transactional actions follows the state 
transition below
+
+* **requested** - Action is requested to start on the timeline

Review Comment:
   can also add that this is the planning phase of any operation.



##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,350 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and 
ensures up-to-date metadata and indexes unlocking multi-fold read and write 
performance optimizations
+- **Data processing engine neutral** - designed to be neutral and not having a 
preferred computation engine. Apache Hudi will manage metadata, provide common 
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into optional high level directory 
structure under the base path (root directory for the Hudi table). The 
directory structure is based on coarse grained partitioning values set for the 
dataset. Non partitioned data sets store all the data files under the base 
path. Hudi storage format has a special reserved *.hoodie* directory under the 
base path that is used to store transaction logs and metadata.
+
+```
+/data/hudi_trips/
+├── .hoodie/
+│   └── metadata/
+├── americas/
+│   ├── brazil/
+│   │   └── sao_paulo/
+│   │       ├── <data_files>
+│   └── united_states/
+│       └── san_francisco/
+│           ├── <data_files>
+└── asia/
+    └── india/
+        └── chennai/
+            ├── <data_files>
+```
+
+Hudi storage format offers two table types offering different trade-offs 
between ingest and query performance and the data files are stored differently 
based on the chosen table type. 
+
+| Table Type    | Tradeoff                                                     
|
+| ------------- | ------------------------------------------------------------ 
|
+| Copy on Write | Optimized for read performance and ideal for slow changing 
datasets |
+| Merge-on-read | Optimized to balance the write and read performance and 
ideal for frequently changing datasets |
+
+
+
+### Data Model
+
+Within each partition, data is organized into key-value model. Every row is 
uniquely identified with a row key. To write a row into Hudi dataset, each row 
must specify the following user fields 
+
+| User fields                 | Description                                    
              |
+| --------------------------- | 
------------------------------------------------------------ |
+| Partitioning key [Optional] | Value of this field defines the directory 
hierarchy within the table base path. This essentially provides an hierarchy 
isolation for managing data and related metadata |
+| Row key                     | Record keys uniquely identify a record/row 
within each partition if partitioning is enabled |
+| Ordering field(s)           | Hudi guarantees the uniqueness constraint of 
row key and the conflict resolution configuration manages strategies on how to 
disambiguate when multiple records with the same keys are to be merged into the 
dataset. The resolution logic can be based on an ordering field or can be 
custom, specific to the dataset. To ensure consistent behavior dealing with 
duplicate records, the resolution logic should be commutative and idempotent |
+
+**Hudi metadata fields**
+
+Hudi format stores the user fields along with the row merged along with 
transactional metadata fields. These fields are encoded in the data-file format 
and available in the table schema. 
+
+[TODO - flush this section out with details on guarantees and how to populate 
them]
+
+| Husi meta-fields              | Description                                  
                |
+| ----------------------------- | 
------------------------------------------------------------ |
+| _hoodie_commit_time [string]  | Every modification to a Hudi dataset creates 
an entry into the Transaction timeline. This entry is identified with the 
commit time. This field matches to the commit time of the instant in the 
timeline that created this record. More on this in Hudi transactions section 
below. |
+| _hoodie_commit_seqno [string] | A unique sequencing number identifying the 
position of the record within each commit [TODO - where do we use this?] |

Review Comment:
   Yeah so this meta field is populated in write handles (create/append). This 
can also uniquely identify a record and can be used to merge records as it is 
composed of the folllowing: instant time, task partition id (from the task 
context depending on the engine), and an atomic counter. As such currently it 
is just used to stamp records with a unique CSN (akin to LSN in some databases 
like sql server). However, we are thinking to use this field for guaranteeing 
uniqueness and merging when we relax record key and precombine field 
requirements.



##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,350 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and 
ensures up-to-date metadata and indexes unlocking multi-fold read and write 
performance optimizations
+- **Data processing engine neutral** - designed to be neutral and not having a 
preferred computation engine. Apache Hudi will manage metadata, provide common 
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into optional high level directory 
structure under the base path (root directory for the Hudi table). The 
directory structure is based on coarse grained partitioning values set for the 
dataset. Non partitioned data sets store all the data files under the base 
path. Hudi storage format has a special reserved *.hoodie* directory under the 
base path that is used to store transaction logs and metadata.
+
+```
+/data/hudi_trips/
+├── .hoodie/
+│   └── metadata/
+├── americas/
+│   ├── brazil/
+│   │   └── sao_paulo/
+│   │       ├── <data_files>
+│   └── united_states/
+│       └── san_francisco/
+│           ├── <data_files>
+└── asia/
+    └── india/
+        └── chennai/
+            ├── <data_files>
+```
+
+Hudi storage format offers two table types offering different trade-offs 
between ingest and query performance and the data files are stored differently 
based on the chosen table type. 
+
+| Table Type    | Tradeoff                                                     
|
+| ------------- | ------------------------------------------------------------ 
|
+| Copy on Write | Optimized for read performance and ideal for slow changing 
datasets |
+| Merge-on-read | Optimized to balance the write and read performance and 
ideal for frequently changing datasets |
+
+
+
+### Data Model
+
+Within each partition, data is organized into key-value model. Every row is 
uniquely identified with a row key. To write a row into Hudi dataset, each row 
must specify the following user fields 
+
+| User fields                 | Description                                    
              |
+| --------------------------- | 
------------------------------------------------------------ |
+| Partitioning key [Optional] | Value of this field defines the directory 
hierarchy within the table base path. This essentially provides an hierarchy 
isolation for managing data and related metadata |
+| Row key                     | Record keys uniquely identify a record/row 
within each partition if partitioning is enabled |
+| Ordering field(s)           | Hudi guarantees the uniqueness constraint of 
row key and the conflict resolution configuration manages strategies on how to 
disambiguate when multiple records with the same keys are to be merged into the 
dataset. The resolution logic can be based on an ordering field or can be 
custom, specific to the dataset. To ensure consistent behavior dealing with 
duplicate records, the resolution logic should be commutative and idempotent |
+
+**Hudi metadata fields**
+
+Hudi format stores the user fields along with the row merged along with 
transactional metadata fields. These fields are encoded in the data-file format 
and available in the table schema. 
+
+[TODO - flush this section out with details on guarantees and how to populate 
them]
+
+| Husi meta-fields              | Description                                  
                |
+| ----------------------------- | 
------------------------------------------------------------ |
+| _hoodie_commit_time [string]  | Every modification to a Hudi dataset creates 
an entry into the Transaction timeline. This entry is identified with the 
commit time. This field matches to the commit time of the instant in the 
timeline that created this record. More on this in Hudi transactions section 
below. |
+| _hoodie_commit_seqno [string] | A unique sequencing number identifying the 
position of the record within each commit [TODO - where do we use this?] |
+| _hoodie_record_key            | Unique record key identifying the row within 
the partition   |
+| _hoodie_partition_path        | Partition path under which the data is 
organized into        |
+| _hoodie_file_name             | The data file name this record belongs to    
                |
+| _hoodie_is_deleted            | Tombstone field to denote the record key is 
deleted          |
+| _hoodie_operation             | [TODO] Is this a CDC Field? Not sure where 
this is used.     |
+
+
+
+## Transaction Log (Timeline)
+
+Data consistency in Hudi is provided using Multi-version Concurrency Control 
(MVCC). Every transactional action on the Hudi table creates a new entry 
(instant) in the timeline. All transactional actions follows the state 
transition below
+
+* **requested** - Action is requested to start on the timeline
+* **inflight** - Action has started running and is currently in-flight
+* **completed** - Action has completed running
+
+All actions and the state transitions are registered with the timeline using 
an atomic put of special meta-file inside the  *.hoodie* directory. The 
requirement from the underlying storage system is to support an atomic-put and 
read-after-write consistency. The meta file structure is as follows
+
+```
+[Action timestamp].[Action type].[Action state] 
+```
+
+* **Action timestamp** - Monotonically increasing value to denote strict 
ordering of actions in the timeline. This could be provided by an external 
token provider or rely on the system epoch time at millisecond granularity. 
+
+* **Action type** - Type of action. The following are the possition actions on 
the hudi timeline. 
+
+* | Action type   | Description                                                
  |
+  | ------------- | 
------------------------------------------------------------ |
+  | commit        | Commit denotes an **atomic write (inserts, updates and 
deletes)** of records in a table. A commit in Hudi is an atomic way of updating 
data, metadata and indexes. The guarantee is that all or none the changes 
within a commit will be visible to the readers |
+  | deltacommit   | Special version of `commit` which is applicable only on a 
Merge-on-Read storage engine. The writes are accumulated and batched to improve 
write performance |
+  | rollback      | Rollback denotes that the changes made by the 
corresponding commit/delta commit was unsuccessful & hence rolled back, 
removing any partial files produced during such a write |
+  | savepoint     | Savepoint is a special marker to ensure a particular 
commit is not automatically cleaned. It helps restore the table to a point on 
the timeline, in case of disaster/data recovery scenarios |
+  | restore       | Restore denotes that a particular Savepoint was restored   
  |
+  | clean         | Maintenance activity that cleans up versions of data files 
that no longer will be accessed |
+  | compaction    | Maintenance to optimize the storage for query performance. 
This action applies the batched up updates from `deltacommit` and re-optimizes 
data files for query performance |
+  | replacecommit | Maintenance activity to cluster the data for better query 
performance. This action is different from a `commit` in that the the table 
state before and after are logically equivalent |
+  | indexing      | [todo]                                                     
  |
+  | schemacommit  | [todo]                                                     
  |
+
+  **Action state** - Denotes the state transition identifier (requested -> 
inflight -> completed)
+
+
+
+Meta-files with completed transaction state contains details about the 
transaction such as the number of inserts/updates/deletes per file ID, file 
size, and some extra metadata such as checkpoint and schema for the batch of 
records written. The data is written in Json and the Avro schema for each of 
these actions are as follows
+
+- `commit ` - 
[HoodieCommitMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieCommitMetadata.avsc)
+- `deltacommit` -  
[HoodieCommitMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieCommitMetadata.avsc)
+- `rollback`- 
[HoodieRollbackMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieRollbackMetadata.avsc)
+- `savepoint` - 
[HoodieSavepointMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieSavePointMetadata.avsc)
+- `restore ` - 
[HoodieRestoreMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieRestoreMetadata.avsc)
+- `clean`  - 
[HoodieCleanMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieCleanMetadata.avsc)
 
+- `compaction` - 
[HoodieCompactionMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieCompactionMetadata.avsc)
+- `replacecommit` - 
[HoodieReplaceCommitMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieReplaceCommitMetadata.avsc)
+- `indexing` - 
[HoodieIndexCommitMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieIndexCommitMetadata.avsc)
+- `schemacommit`- 

Review Comment:
   This is just `InternalSchema` serialized to byte[].
   
https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/internal/schema/InternalSchema.java



##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,350 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and 
ensures up-to-date metadata and indexes unlocking multi-fold read and write 
performance optimizations
+- **Data processing engine neutral** - designed to be neutral and not having a 
preferred computation engine. Apache Hudi will manage metadata, provide common 
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into optional high level directory 
structure under the base path (root directory for the Hudi table). The 
directory structure is based on coarse grained partitioning values set for the 
dataset. Non partitioned data sets store all the data files under the base 
path. Hudi storage format has a special reserved *.hoodie* directory under the 
base path that is used to store transaction logs and metadata.
+
+```
+/data/hudi_trips/
+├── .hoodie/
+│   └── metadata/
+├── americas/
+│   ├── brazil/
+│   │   └── sao_paulo/
+│   │       ├── <data_files>
+│   └── united_states/
+│       └── san_francisco/
+│           ├── <data_files>
+└── asia/
+    └── india/
+        └── chennai/
+            ├── <data_files>
+```
+
+Hudi storage format offers two table types offering different trade-offs 
between ingest and query performance and the data files are stored differently 
based on the chosen table type. 
+
+| Table Type    | Tradeoff                                                     
|
+| ------------- | ------------------------------------------------------------ 
|
+| Copy on Write | Optimized for read performance and ideal for slow changing 
datasets |
+| Merge-on-read | Optimized to balance the write and read performance and 
ideal for frequently changing datasets |
+
+
+
+### Data Model
+
+Within each partition, data is organized into key-value model. Every row is 
uniquely identified with a row key. To write a row into Hudi dataset, each row 
must specify the following user fields 
+
+| User fields                 | Description                                    
              |
+| --------------------------- | 
------------------------------------------------------------ |
+| Partitioning key [Optional] | Value of this field defines the directory 
hierarchy within the table base path. This essentially provides an hierarchy 
isolation for managing data and related metadata |
+| Row key                     | Record keys uniquely identify a record/row 
within each partition if partitioning is enabled |
+| Ordering field(s)           | Hudi guarantees the uniqueness constraint of 
row key and the conflict resolution configuration manages strategies on how to 
disambiguate when multiple records with the same keys are to be merged into the 
dataset. The resolution logic can be based on an ordering field or can be 
custom, specific to the dataset. To ensure consistent behavior dealing with 
duplicate records, the resolution logic should be commutative and idempotent |
+
+**Hudi metadata fields**
+
+Hudi format stores the user fields along with the row merged along with 
transactional metadata fields. These fields are encoded in the data-file format 
and available in the table schema. 
+
+[TODO - flush this section out with details on guarantees and how to populate 
them]
+
+| Husi meta-fields              | Description                                  
                |
+| ----------------------------- | 
------------------------------------------------------------ |
+| _hoodie_commit_time [string]  | Every modification to a Hudi dataset creates 
an entry into the Transaction timeline. This entry is identified with the 
commit time. This field matches to the commit time of the instant in the 
timeline that created this record. More on this in Hudi transactions section 
below. |
+| _hoodie_commit_seqno [string] | A unique sequencing number identifying the 
position of the record within each commit [TODO - where do we use this?] |
+| _hoodie_record_key            | Unique record key identifying the row within 
the partition   |
+| _hoodie_partition_path        | Partition path under which the data is 
organized into        |
+| _hoodie_file_name             | The data file name this record belongs to    
                |
+| _hoodie_is_deleted            | Tombstone field to denote the record key is 
deleted          |
+| _hoodie_operation             | [TODO] Is this a CDC Field? Not sure where 
this is used.     |
+
+
+
+## Transaction Log (Timeline)
+
+Data consistency in Hudi is provided using Multi-version Concurrency Control 
(MVCC). Every transactional action on the Hudi table creates a new entry 
(instant) in the timeline. All transactional actions follows the state 
transition below
+
+* **requested** - Action is requested to start on the timeline
+* **inflight** - Action has started running and is currently in-flight
+* **completed** - Action has completed running
+
+All actions and the state transitions are registered with the timeline using 
an atomic put of special meta-file inside the  *.hoodie* directory. The 
requirement from the underlying storage system is to support an atomic-put and 
read-after-write consistency. The meta file structure is as follows
+
+```
+[Action timestamp].[Action type].[Action state] 
+```
+
+* **Action timestamp** - Monotonically increasing value to denote strict 
ordering of actions in the timeline. This could be provided by an external 
token provider or rely on the system epoch time at millisecond granularity. 
+
+* **Action type** - Type of action. The following are the possition actions on 
the hudi timeline. 
+
+* | Action type   | Description                                                
  |
+  | ------------- | 
------------------------------------------------------------ |
+  | commit        | Commit denotes an **atomic write (inserts, updates and 
deletes)** of records in a table. A commit in Hudi is an atomic way of updating 
data, metadata and indexes. The guarantee is that all or none the changes 
within a commit will be visible to the readers |
+  | deltacommit   | Special version of `commit` which is applicable only on a 
Merge-on-Read storage engine. The writes are accumulated and batched to improve 
write performance |
+  | rollback      | Rollback denotes that the changes made by the 
corresponding commit/delta commit was unsuccessful & hence rolled back, 
removing any partial files produced during such a write |
+  | savepoint     | Savepoint is a special marker to ensure a particular 
commit is not automatically cleaned. It helps restore the table to a point on 
the timeline, in case of disaster/data recovery scenarios |
+  | restore       | Restore denotes that a particular Savepoint was restored   
  |
+  | clean         | Maintenance activity that cleans up versions of data files 
that no longer will be accessed |
+  | compaction    | Maintenance to optimize the storage for query performance. 
This action applies the batched up updates from `deltacommit` and re-optimizes 
data files for query performance |
+  | replacecommit | Maintenance activity to cluster the data for better query 
performance. This action is different from a `commit` in that the the table 
state before and after are logically equivalent |
+  | indexing      | [todo]                                                     
  |
+  | schemacommit  | [todo]                                                     
  |
+
+  **Action state** - Denotes the state transition identifier (requested -> 
inflight -> completed)
+
+
+
+Meta-files with completed transaction state contains details about the 
transaction such as the number of inserts/updates/deletes per file ID, file 
size, and some extra metadata such as checkpoint and schema for the batch of 
records written. The data is written in Json and the Avro schema for each of 
these actions are as follows
+
+- `commit ` - 
[HoodieCommitMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieCommitMetadata.avsc)
+- `deltacommit` -  
[HoodieCommitMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieCommitMetadata.avsc)
+- `rollback`- 
[HoodieRollbackMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieRollbackMetadata.avsc)
+- `savepoint` - 
[HoodieSavepointMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieSavePointMetadata.avsc)
+- `restore ` - 
[HoodieRestoreMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieRestoreMetadata.avsc)
+- `clean`  - 
[HoodieCleanMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieCleanMetadata.avsc)
 
+- `compaction` - 
[HoodieCompactionMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieCompactionMetadata.avsc)
+- `replacecommit` - 
[HoodieReplaceCommitMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieReplaceCommitMetadata.avsc)
+- `indexing` - 
[HoodieIndexCommitMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieIndexCommitMetadata.avsc)
+- `schemacommit`- 
+
+Reconciling all the actions in the timeline, the state of the Hudi dataset can 
be re-created at any instant of time.    
+
+
+
+## **Metadata**
+
+Hudi automatically extracts the physical data statistics and stores the 
metadata along with the data to improve write and query performance. Hudi 
Metadata is an internally-managed table which organizes the table metadata 
under the base path *.hoodie/metadata.* The data format used is similar to the 
merge-on-read data format. Every record stored in the metadata table is a Hudi 
row and hence has partitioning key and row key specified. Following are the 
metadata table partitions
+
+- **files** - Partition path to file name index. Key for the Hudi record is 
the partition path and the actual record is a map of file name to an instance 
of 
[HoodieMetadataFileInfo](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieMetadata.avsc#L34)
 . The files index can be used to do file listing and do filter based pruning 
of the scanset during query
+
+- **bloom_filters** - Bloom filter index to help map a record key to the 
actual file. The Hudi key is `hash(partition name) + hash(file name)` and the 
actual payload is an instance of 
[HudiMetadataBloomFilter](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieMetadata.avsc#L66)
 . Bloom filter helps with identifying the files to update and optimize joins 
with row key and point queries with row key filter.
+- **column_stats** - contains statistics of all the columns for all the rows 
in the table. This enables file grained file pruning for filters and join 
conditions in the query. The actual payload is an instance of 
[HoodieMetadataColumnStats](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieMetadata.avsc#L101).
 
+
+
+
+## Data Organization
+
+As mentioned in the data model, data is partitioned coarsely through a 
directory hierarchy based on the partition path configured. Within each 
partition the data is physically stored as **base and log files** and organized 
into logical concepts as **File groups and File slices**. The logical concepts 
will be referred by the writer / reader requirements. 
+
+**File group** - Groups multiple versions of a base file. File group is 
uniquely identified by a File id. Each version corresponds to the commit 
instants timestamp recording updates to rows in the file. The base file are 
stored in open source data formats like  Apache Parquet, Apache ORC, Apache 
HBase HFile etc.
+
+**File slice** - A File group can further be split into multiple files. A base 
file corresponding to commit timestamp and a set of log files that batches the 
deletes/updates to the base file and each log file corresponds to a delta 
commit timestamp and the delta commits happen after the base file commit in the 
Hudi timeline. 
+
+
+
+### **Base file**
+
+The base file name format is:
+
+```
+[File Id]_[File Write Token]_[Transaction timestamp].[File Extension]
+```
+
+- **File Id** - Uniquely identify a base file within a partition. Multiple 
versions of the base file share the same file id.
+- **File Write Token** - Monotonically increasing token for every attempt to 
write the base file. This should help uniquely identifying the base file when 
there are failures and retries. Cleaner can cleanup partial base files if the 
write token is not the latest in the file group 
+- **Commit timestamp** - Timestamp matching the commit instant in the timeline 
that created this base file
+- **File Extension** - base file extension to denote the open source file 
format such as .parquet, .orc
+
+
+
+### Log File Format
+
+The log file name format is:
+
+```
+[File Id]_[Base Transaction timestamp].[Log File Extension].[Log File 
Version]_[File Write Token]
+```
+
+- **File Id** - File Id of the base file in the slice
+- **Base Transaction timestamp** - Commit timestamp on the base file for which 
the log file is updating the deletes/updates for
+- **Log File Extension** - Extension defines the format used for the log file 
(e.g. Hudi proprietary log format)
+- **Log File Version** - Current version of the log file format
+- **File Write Token** - Monotonically increasing token for every attempt to 
write the log file. This should help uniquely identifying the log file when 
there are failures and retries. Cleaner can cleanup partial log files if the 
write token is not the latest in the file slice.
+
+The Log file format structure is a Hudi native format. The actual content 
bytes are serialized into one of Apache Avro, Apache Parquet or Apache HFile 
file formats based on configuration and the other metadata in the block is 
serialized using the Java DataOutputStream (DOS) serialized.
+
+Hudi Log format specification is as follows. 
+
+![hudi_log_format_v2](/assets/images/hudi_log_format_v2.png)
+
+| Section                | #Bytes   | Description                              
                    |
+| ---------------------- | -------- | 
------------------------------------------------------------ |
+| **magic**              | 6        | 6 Characters '#HUDI#' stored as a byte 
array. Sanity check for block corruption to assert start 6 bytes matches the 
magic byte[]. |
+| **LogBlock length**    | 8        | Length of the block excluding the magic. 
                    |
+| **version**            | 4        | Version of the Log file format, 
monotonically increasing to support backwards compatibility |
+| **type**               | 4        | Represents the type of the log block. Id 
of the type is serialized as an Integer. |
+| **header length**      | 8        | Length of the header section to follow   
                    |
+| **header**             | variable | Map of header metadata entries. The 
entries are encoded with key as a metadata Id and the value is the String 
representation of the metadata value. |
+| **content length**     | 8        | Length of the actual content serialized  
                    |
+| **content**            | variable | The content contains the serialized 
records in one of the supported file formats (Apache Avro, Apache Parquet or 
Apache HFile) |
+| **foot length**        | 8        | Length of the footer section to follow   
                    |
+| **footer**             | variable | Similar to Header. Map of footer 
metadata entries. The entries are encoded with key as a metadata Id and the 
value is the String representation of the metadata value. |
+| **total block length** | 8        | Total size of the block including the 
magic bytes. This is used to determine if a block is corrupt by comparing to 
the block size in the header. Each log block assumes that the block size will 
be last data written in a block. Any data if written after is just ignored. |
+
+Metadata key mapping from Integer to actual metadata is as follows
+
+1. Instant Time (enconding id: 1)
+2. Target Instant Time (encoding id: 2)
+3. Command Block Type (encoding id: 3)
+
+
+
+#### Log file format block types
+
+The following are the possible block types used in Hudi Log Format:
+
+##### Command Block (Id: 1)
+
+Encodes a command to the log reader. The Command block must be 0 byte content 
block which only populates the metadata Command Block Type. Only possible 
values in the current version of the log format is ROLLBACK_PREVIOUS_BLOCK, 
which lets the reader to undo the previous block written in the log file. This 
denotes that the previous action that wrote the log block was unseuccessful. 
+
+##### Delete Block (Id: 2)
+
+![spec_log_format_delete_block](/assets/images/spec/spec_log_format_delete_block.png)
+
+| Section        | #bytes   | Description                                      
            |
+| -------------- | -------- | 
------------------------------------------------------------ |
+| format version | 4        | version of the log file format                   
            |
+| length         | 8        | length of the deleted keys section to follow     
            |
+| deleted keys   | variable | Tombstone of the row to encode a delete.  The 
following 3 fields are serialized using the KryoSerializer.  <br />**Row Key** 
- Unique row key within the partition to deleted <br />**Partition Path** - 
Partition path of the record deleted <br />**Ordering Value** - In a particular 
batch of updates, the delete block is always written after the data 
(Avro/HFile/Parquet) block. This field would preserve the ordering of deletes 
and inserts within the same batch. |
+
+##### Corrupted Block (Id: 3)
+
+This block type is never written to persistent storage. While reading a log 
block, if the block is corrupted, then the reader gets an instance of the 
Corrupted Block instead of a Data block. 
+
+##### Avro Block (Id: 4)
+
+Data block serializes the actual records written into the log file
+
+![spec_log_format_avro_block](/assets/images/spec/spec_log_format_avro_block.png)
+
+| Section        | #bytes   | Description                                      
            |
+| -------------- | -------- | 
------------------------------------------------------------ |
+| format version | 4        | version of the log file format                   
            |
+| record count   | 4        | total number of records in this block            
            |
+| record length  | 8        | length of the record content to follow           
            |
+| record content | variable | Row represented as an Avro record serialized 
using BinaryEncoder |
+
+##### HFile Block (Id: 5)
+
+The HFile data block serializes the records using the HFile file format. HFile 
datamodel is a key value pair and both are encoded as byte arrays. Hudi row key 
is encoded as avro string and the Avro record serialized using BinaryEncoder is 
stored as the value. HFile file format stores the rows in sorted order and with 
index to enable quick point reads and range scans. 
+
+##### Parquet Block (Id: 6)
+
+The Parquet Block serializes the records using the Apache Parquet file format. 
The serialization layout is similar to the Avro block except for the byte array 
content encoded as columnar Parquet format. This log block type enables 
efficient columnar scans and better compression. 
+
+> Different data block types offers different tradeoffs and picking the right 
block is based on the workload requirements and is critical for merge and read 
performance. 
+
+
+
+## Reader Expections
+
+Readers will use snapshot isolation to query a hudi dataset at a consistent 
point in time in the Hudi timeline.  The reader constructs the snapshot state 
using the following steps
+
+1. Pick an instant in the timeline (last successful commit or a specific 
commit version explicitly queried) and set that the commit time to compute the 
list of files to read from. 
+2. For the picked commit time, compute all the file slices that belongs to 
that specific commit time. For all the partition paths involved in the query, 
the file slices that belong to an successful commit before the picked commit 
should be included. The lookup on the filesystem could be slow and inefficient 
and can be further optimized by caching in memory or using the files (partition 
path to filename) index or with a support of a external timeline serving 
system.  
+3. For the merge on read table type, ensure the apprpriate merging rules are 
applied to apply the updates queued for the base in the log files.
+   1. [TODO] List merging rules - (ordering field).

Review Comment:
   This really depends on the payload implementation. But, we can talk about 
default one maybe `OverwriteWithLatestAvroPayload`?



##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,350 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and 
ensures up-to-date metadata and indexes unlocking multi-fold read and write 
performance optimizations
+- **Data processing engine neutral** - designed to be neutral and not having a 
preferred computation engine. Apache Hudi will manage metadata, provide common 
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into optional high level directory 
structure under the base path (root directory for the Hudi table). The 
directory structure is based on coarse grained partitioning values set for the 
dataset. Non partitioned data sets store all the data files under the base 
path. Hudi storage format has a special reserved *.hoodie* directory under the 
base path that is used to store transaction logs and metadata.
+
+```
+/data/hudi_trips/
+├── .hoodie/
+│   └── metadata/
+├── americas/
+│   ├── brazil/
+│   │   └── sao_paulo/
+│   │       ├── <data_files>
+│   └── united_states/
+│       └── san_francisco/
+│           ├── <data_files>
+└── asia/
+    └── india/
+        └── chennai/
+            ├── <data_files>
+```
+
+Hudi storage format offers two table types offering different trade-offs 
between ingest and query performance and the data files are stored differently 
based on the chosen table type. 
+
+| Table Type    | Tradeoff                                                     
|
+| ------------- | ------------------------------------------------------------ 
|
+| Copy on Write | Optimized for read performance and ideal for slow changing 
datasets |
+| Merge-on-read | Optimized to balance the write and read performance and 
ideal for frequently changing datasets |
+
+
+
+### Data Model
+
+Within each partition, data is organized into key-value model. Every row is 
uniquely identified with a row key. To write a row into Hudi dataset, each row 
must specify the following user fields 
+
+| User fields                 | Description                                    
              |
+| --------------------------- | 
------------------------------------------------------------ |
+| Partitioning key [Optional] | Value of this field defines the directory 
hierarchy within the table base path. This essentially provides an hierarchy 
isolation for managing data and related metadata |
+| Row key                     | Record keys uniquely identify a record/row 
within each partition if partitioning is enabled |

Review Comment:
   ```suggestion
   | Row key(s)                     | Record keys uniquely identify a 
record/row within each partition if partitioning is enabled |
   ```



##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,350 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 

Review Comment:
   Apart from features listed below, do we also want to mention design 
principles? 
https://cwiki.apache.org/confluence/display/HUDI/Design+And+Architecture#DesignAndArchitecture-DesignPrinciples



##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,350 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and 
ensures up-to-date metadata and indexes unlocking multi-fold read and write 
performance optimizations
+- **Data processing engine neutral** - designed to be neutral and not having a 
preferred computation engine. Apache Hudi will manage metadata, provide common 
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into optional high level directory 
structure under the base path (root directory for the Hudi table). The 
directory structure is based on coarse grained partitioning values set for the 
dataset. Non partitioned data sets store all the data files under the base 
path. Hudi storage format has a special reserved *.hoodie* directory under the 
base path that is used to store transaction logs and metadata.
+
+```
+/data/hudi_trips/
+├── .hoodie/
+│   └── metadata/
+├── americas/
+│   ├── brazil/
+│   │   └── sao_paulo/
+│   │       ├── <data_files>
+│   └── united_states/
+│       └── san_francisco/
+│           ├── <data_files>
+└── asia/
+    └── india/
+        └── chennai/
+            ├── <data_files>
+```
+
+Hudi storage format offers two table types offering different trade-offs 
between ingest and query performance and the data files are stored differently 
based on the chosen table type. 
+
+| Table Type    | Tradeoff                                                     
|
+| ------------- | ------------------------------------------------------------ 
|
+| Copy on Write | Optimized for read performance and ideal for slow changing 
datasets |
+| Merge-on-read | Optimized to balance the write and read performance and 
ideal for frequently changing datasets |
+
+
+
+### Data Model
+
+Within each partition, data is organized into key-value model. Every row is 
uniquely identified with a row key. To write a row into Hudi dataset, each row 
must specify the following user fields 
+
+| User fields                 | Description                                    
              |
+| --------------------------- | 
------------------------------------------------------------ |
+| Partitioning key [Optional] | Value of this field defines the directory 
hierarchy within the table base path. This essentially provides an hierarchy 
isolation for managing data and related metadata |
+| Row key                     | Record keys uniquely identify a record/row 
within each partition if partitioning is enabled |
+| Ordering field(s)           | Hudi guarantees the uniqueness constraint of 
row key and the conflict resolution configuration manages strategies on how to 
disambiguate when multiple records with the same keys are to be merged into the 
dataset. The resolution logic can be based on an ordering field or can be 
custom, specific to the dataset. To ensure consistent behavior dealing with 
duplicate records, the resolution logic should be commutative and idempotent |
+
+**Hudi metadata fields**
+
+Hudi format stores the user fields along with the row merged along with 
transactional metadata fields. These fields are encoded in the data-file format 
and available in the table schema. 
+
+[TODO - flush this section out with details on guarantees and how to populate 
them]
+
+| Husi meta-fields              | Description                                  
                |
+| ----------------------------- | 
------------------------------------------------------------ |
+| _hoodie_commit_time [string]  | Every modification to a Hudi dataset creates 
an entry into the Transaction timeline. This entry is identified with the 
commit time. This field matches to the commit time of the instant in the 
timeline that created this record. More on this in Hudi transactions section 
below. |
+| _hoodie_commit_seqno [string] | A unique sequencing number identifying the 
position of the record within each commit [TODO - where do we use this?] |
+| _hoodie_record_key            | Unique record key identifying the row within 
the partition   |
+| _hoodie_partition_path        | Partition path under which the data is 
organized into        |
+| _hoodie_file_name             | The data file name this record belongs to    
                |
+| _hoodie_is_deleted            | Tombstone field to denote the record key is 
deleted          |
+| _hoodie_operation             | [TODO] Is this a CDC Field? Not sure where 
this is used.     |
+
+
+
+## Transaction Log (Timeline)
+
+Data consistency in Hudi is provided using Multi-version Concurrency Control 
(MVCC). Every transactional action on the Hudi table creates a new entry 
(instant) in the timeline. All transactional actions follows the state 
transition below
+
+* **requested** - Action is requested to start on the timeline
+* **inflight** - Action has started running and is currently in-flight
+* **completed** - Action has completed running
+
+All actions and the state transitions are registered with the timeline using 
an atomic put of special meta-file inside the  *.hoodie* directory. The 
requirement from the underlying storage system is to support an atomic-put and 
read-after-write consistency. The meta file structure is as follows
+
+```
+[Action timestamp].[Action type].[Action state] 
+```
+
+* **Action timestamp** - Monotonically increasing value to denote strict 
ordering of actions in the timeline. This could be provided by an external 
token provider or rely on the system epoch time at millisecond granularity. 
+
+* **Action type** - Type of action. The following are the possition actions on 
the hudi timeline. 
+
+* | Action type   | Description                                                
  |
+  | ------------- | 
------------------------------------------------------------ |
+  | commit        | Commit denotes an **atomic write (inserts, updates and 
deletes)** of records in a table. A commit in Hudi is an atomic way of updating 
data, metadata and indexes. The guarantee is that all or none the changes 
within a commit will be visible to the readers |
+  | deltacommit   | Special version of `commit` which is applicable only on a 
Merge-on-Read storage engine. The writes are accumulated and batched to improve 
write performance |
+  | rollback      | Rollback denotes that the changes made by the 
corresponding commit/delta commit was unsuccessful & hence rolled back, 
removing any partial files produced during such a write |
+  | savepoint     | Savepoint is a special marker to ensure a particular 
commit is not automatically cleaned. It helps restore the table to a point on 
the timeline, in case of disaster/data recovery scenarios |
+  | restore       | Restore denotes that a particular Savepoint was restored   
  |
+  | clean         | Maintenance activity that cleans up versions of data files 
that no longer will be accessed |
+  | compaction    | Maintenance to optimize the storage for query performance. 
This action applies the batched up updates from `deltacommit` and re-optimizes 
data files for query performance |
+  | replacecommit | Maintenance activity to cluster the data for better query 
performance. This action is different from a `commit` in that the the table 
state before and after are logically equivalent |
+  | indexing      | [todo]                                                     
  |
+  | schemacommit  | [todo]                                                     
  |
+
+  **Action state** - Denotes the state transition identifier (requested -> 
inflight -> completed)
+
+
+
+Meta-files with completed transaction state contains details about the 
transaction such as the number of inserts/updates/deletes per file ID, file 
size, and some extra metadata such as checkpoint and schema for the batch of 
records written. The data is written in Json and the Avro schema for each of 
these actions are as follows
+
+- `commit ` - 
[HoodieCommitMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieCommitMetadata.avsc)
+- `deltacommit` -  
[HoodieCommitMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieCommitMetadata.avsc)
+- `rollback`- 
[HoodieRollbackMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieRollbackMetadata.avsc)
+- `savepoint` - 
[HoodieSavepointMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieSavePointMetadata.avsc)
+- `restore ` - 
[HoodieRestoreMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieRestoreMetadata.avsc)
+- `clean`  - 
[HoodieCleanMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieCleanMetadata.avsc)
 
+- `compaction` - 
[HoodieCompactionMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieCompactionMetadata.avsc)
+- `replacecommit` - 
[HoodieReplaceCommitMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieReplaceCommitMetadata.avsc)
+- `indexing` - 
[HoodieIndexCommitMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieIndexCommitMetadata.avsc)
+- `schemacommit`- 
+
+Reconciling all the actions in the timeline, the state of the Hudi dataset can 
be re-created at any instant of time.    
+
+
+
+## **Metadata**
+
+Hudi automatically extracts the physical data statistics and stores the 
metadata along with the data to improve write and query performance. Hudi 
Metadata is an internally-managed table which organizes the table metadata 
under the base path *.hoodie/metadata.* The data format used is similar to the 
merge-on-read data format. Every record stored in the metadata table is a Hudi 
row and hence has partitioning key and row key specified. Following are the 
metadata table partitions
+
+- **files** - Partition path to file name index. Key for the Hudi record is 
the partition path and the actual record is a map of file name to an instance 
of 
[HoodieMetadataFileInfo](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieMetadata.avsc#L34)
 . The files index can be used to do file listing and do filter based pruning 
of the scanset during query
+
+- **bloom_filters** - Bloom filter index to help map a record key to the 
actual file. The Hudi key is `hash(partition name) + hash(file name)` and the 
actual payload is an instance of 
[HudiMetadataBloomFilter](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieMetadata.avsc#L66)
 . Bloom filter helps with identifying the files to update and optimize joins 
with row key and point queries with row key filter.
+- **column_stats** - contains statistics of all the columns for all the rows 
in the table. This enables file grained file pruning for filters and join 
conditions in the query. The actual payload is an instance of 
[HoodieMetadataColumnStats](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieMetadata.avsc#L101).
 
+
+
+
+## Data Organization
+
+As mentioned in the data model, data is partitioned coarsely through a 
directory hierarchy based on the partition path configured. Within each 
partition the data is physically stored as **base and log files** and organized 
into logical concepts as **File groups and File slices**. The logical concepts 
will be referred by the writer / reader requirements. 
+
+**File group** - Groups multiple versions of a base file. File group is 
uniquely identified by a File id. Each version corresponds to the commit 
instants timestamp recording updates to rows in the file. The base file are 
stored in open source data formats like  Apache Parquet, Apache ORC, Apache 
HBase HFile etc.
+
+**File slice** - A File group can further be split into multiple files. A base 
file corresponding to commit timestamp and a set of log files that batches the 
deletes/updates to the base file and each log file corresponds to a delta 
commit timestamp and the delta commits happen after the base file commit in the 
Hudi timeline. 
+
+
+
+### **Base file**
+
+The base file name format is:
+
+```
+[File Id]_[File Write Token]_[Transaction timestamp].[File Extension]
+```
+
+- **File Id** - Uniquely identify a base file within a partition. Multiple 
versions of the base file share the same file id.
+- **File Write Token** - Monotonically increasing token for every attempt to 
write the base file. This should help uniquely identifying the base file when 
there are failures and retries. Cleaner can cleanup partial base files if the 
write token is not the latest in the file group 
+- **Commit timestamp** - Timestamp matching the commit instant in the timeline 
that created this base file
+- **File Extension** - base file extension to denote the open source file 
format such as .parquet, .orc
+
+
+
+### Log File Format
+
+The log file name format is:
+
+```
+[File Id]_[Base Transaction timestamp].[Log File Extension].[Log File 
Version]_[File Write Token]
+```
+
+- **File Id** - File Id of the base file in the slice
+- **Base Transaction timestamp** - Commit timestamp on the base file for which 
the log file is updating the deletes/updates for
+- **Log File Extension** - Extension defines the format used for the log file 
(e.g. Hudi proprietary log format)
+- **Log File Version** - Current version of the log file format
+- **File Write Token** - Monotonically increasing token for every attempt to 
write the log file. This should help uniquely identifying the log file when 
there are failures and retries. Cleaner can cleanup partial log files if the 
write token is not the latest in the file slice.
+
+The Log file format structure is a Hudi native format. The actual content 
bytes are serialized into one of Apache Avro, Apache Parquet or Apache HFile 
file formats based on configuration and the other metadata in the block is 
serialized using the Java DataOutputStream (DOS) serialized.
+
+Hudi Log format specification is as follows. 
+
+![hudi_log_format_v2](/assets/images/hudi_log_format_v2.png)
+
+| Section                | #Bytes   | Description                              
                    |
+| ---------------------- | -------- | 
------------------------------------------------------------ |
+| **magic**              | 6        | 6 Characters '#HUDI#' stored as a byte 
array. Sanity check for block corruption to assert start 6 bytes matches the 
magic byte[]. |
+| **LogBlock length**    | 8        | Length of the block excluding the magic. 
                    |
+| **version**            | 4        | Version of the Log file format, 
monotonically increasing to support backwards compatibility |
+| **type**               | 4        | Represents the type of the log block. Id 
of the type is serialized as an Integer. |
+| **header length**      | 8        | Length of the header section to follow   
                    |
+| **header**             | variable | Map of header metadata entries. The 
entries are encoded with key as a metadata Id and the value is the String 
representation of the metadata value. |
+| **content length**     | 8        | Length of the actual content serialized  
                    |
+| **content**            | variable | The content contains the serialized 
records in one of the supported file formats (Apache Avro, Apache Parquet or 
Apache HFile) |
+| **foot length**        | 8        | Length of the footer section to follow   
                    |
+| **footer**             | variable | Similar to Header. Map of footer 
metadata entries. The entries are encoded with key as a metadata Id and the 
value is the String representation of the metadata value. |
+| **total block length** | 8        | Total size of the block including the 
magic bytes. This is used to determine if a block is corrupt by comparing to 
the block size in the header. Each log block assumes that the block size will 
be last data written in a block. Any data if written after is just ignored. |
+
+Metadata key mapping from Integer to actual metadata is as follows
+
+1. Instant Time (enconding id: 1)
+2. Target Instant Time (encoding id: 2)
+3. Command Block Type (encoding id: 3)
+
+
+
+#### Log file format block types
+
+The following are the possible block types used in Hudi Log Format:
+
+##### Command Block (Id: 1)
+
+Encodes a command to the log reader. The Command block must be 0 byte content 
block which only populates the metadata Command Block Type. Only possible 
values in the current version of the log format is ROLLBACK_PREVIOUS_BLOCK, 
which lets the reader to undo the previous block written in the log file. This 
denotes that the previous action that wrote the log block was unseuccessful. 
+
+##### Delete Block (Id: 2)
+
+![spec_log_format_delete_block](/assets/images/spec/spec_log_format_delete_block.png)
+
+| Section        | #bytes   | Description                                      
            |
+| -------------- | -------- | 
------------------------------------------------------------ |
+| format version | 4        | version of the log file format                   
            |
+| length         | 8        | length of the deleted keys section to follow     
            |
+| deleted keys   | variable | Tombstone of the row to encode a delete.  The 
following 3 fields are serialized using the KryoSerializer.  <br />**Row Key** 
- Unique row key within the partition to deleted <br />**Partition Path** - 
Partition path of the record deleted <br />**Ordering Value** - In a particular 
batch of updates, the delete block is always written after the data 
(Avro/HFile/Parquet) block. This field would preserve the ordering of deletes 
and inserts within the same batch. |
+
+##### Corrupted Block (Id: 3)
+
+This block type is never written to persistent storage. While reading a log 
block, if the block is corrupted, then the reader gets an instance of the 
Corrupted Block instead of a Data block. 
+
+##### Avro Block (Id: 4)
+
+Data block serializes the actual records written into the log file
+
+![spec_log_format_avro_block](/assets/images/spec/spec_log_format_avro_block.png)
+
+| Section        | #bytes   | Description                                      
            |
+| -------------- | -------- | 
------------------------------------------------------------ |
+| format version | 4        | version of the log file format                   
            |
+| record count   | 4        | total number of records in this block            
            |
+| record length  | 8        | length of the record content to follow           
            |
+| record content | variable | Row represented as an Avro record serialized 
using BinaryEncoder |
+
+##### HFile Block (Id: 5)
+
+The HFile data block serializes the records using the HFile file format. HFile 
datamodel is a key value pair and both are encoded as byte arrays. Hudi row key 
is encoded as avro string and the Avro record serialized using BinaryEncoder is 
stored as the value. HFile file format stores the rows in sorted order and with 
index to enable quick point reads and range scans. 
+
+##### Parquet Block (Id: 6)
+
+The Parquet Block serializes the records using the Apache Parquet file format. 
The serialization layout is similar to the Avro block except for the byte array 
content encoded as columnar Parquet format. This log block type enables 
efficient columnar scans and better compression. 
+
+> Different data block types offers different tradeoffs and picking the right 
block is based on the workload requirements and is critical for merge and read 
performance. 
+
+
+
+## Reader Expections
+
+Readers will use snapshot isolation to query a hudi dataset at a consistent 
point in time in the Hudi timeline.  The reader constructs the snapshot state 
using the following steps
+
+1. Pick an instant in the timeline (last successful commit or a specific 
commit version explicitly queried) and set that the commit time to compute the 
list of files to read from. 
+2. For the picked commit time, compute all the file slices that belongs to 
that specific commit time. For all the partition paths involved in the query, 
the file slices that belong to an successful commit before the picked commit 
should be included. The lookup on the filesystem could be slow and inefficient 
and can be further optimized by caching in memory or using the files (partition 
path to filename) index or with a support of a external timeline serving 
system.  
+3. For the merge on read table type, ensure the apprpriate merging rules are 
applied to apply the updates queued for the base in the log files.
+   1. [TODO] List merging rules - (ordering field).
+
+
+
+## Writer Expectations
+
+Writer into Hudi will have to ingest new records, updates to existing records 
or delete records into the dataset. All transactional actions follow the same 
state transition as described in the transaction log (timeline) section. 
Writers will optimistically create new base ang log files and will finally 
transition the action state to completed to register all the modifications to 
the dataset atomically. Writer merges the data using the following steps
+
+1. Writer will pick a monotonically increasing instant time from the latest 
state of the Hudi timeline (**action commit time**) and will pick the last 
successful commit instant (**merge commit time**) to merge the changes to. If 
the merge succeeds, then action commit time will be the next successful commit 
in the timeline. 
+2. For all the incoming records, the writer will have to efficiently determine 
if this is an update or insert. This is done by a process called tagging - 
which is a batched point lookups of the row key and partition path pairs in the 
entire dataset. The efficiency of tagging is critical to the merge performance. 
This can be optimized with indexes (bloom, global key value based index) and 
caching. New records will not have a tag. 
+3. Once records are tagged, the writer can apply them onto the specific file 
slice. 
+   1. For copy on write, writer will create a new slice (action commit time) 
of the base file in the file group
+   2. For merge on read, writer will create a new log file with the action 
commit time on the merge commit time file slice
+4. Deletes are encoded as special form of updates where only the meta fields 
and the operation is populated. See the delete block type in log format block 
types. 
+5. Once all the writes into the file system is complete, concurrency control 
checks happen to ensure there are no overlapping writes and if that succeeds, 
the commit action is completed in the timeline atomically making the changes 
merged visible for the next reader.  
+
+
+
+## Table maintenance
+
+[TODO.- flush out what needs to be done for each of these]

Review Comment:
   We can pick certain sections from Hudi docs, e.g. for clustering 
https://hudi.apache.org/docs/clustering#running-clustering
   Let's discuss how much details we want here.



##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,350 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and 
ensures up-to-date metadata and indexes unlocking multi-fold read and write 
performance optimizations
+- **Data processing engine neutral** - designed to be neutral and not having a 
preferred computation engine. Apache Hudi will manage metadata, provide common 
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into optional high level directory 
structure under the base path (root directory for the Hudi table). The 
directory structure is based on coarse grained partitioning values set for the 
dataset. Non partitioned data sets store all the data files under the base 
path. Hudi storage format has a special reserved *.hoodie* directory under the 
base path that is used to store transaction logs and metadata.
+
+```
+/data/hudi_trips/
+├── .hoodie/
+│   └── metadata/
+├── americas/
+│   ├── brazil/
+│   │   └── sao_paulo/
+│   │       ├── <data_files>
+│   └── united_states/
+│       └── san_francisco/
+│           ├── <data_files>
+└── asia/
+    └── india/
+        └── chennai/
+            ├── <data_files>
+```
+
+Hudi storage format offers two table types offering different trade-offs 
between ingest and query performance and the data files are stored differently 
based on the chosen table type. 
+
+| Table Type    | Tradeoff                                                     
|
+| ------------- | ------------------------------------------------------------ 
|
+| Copy on Write | Optimized for read performance and ideal for slow changing 
datasets |
+| Merge-on-read | Optimized to balance the write and read performance and 
ideal for frequently changing datasets |
+
+
+
+### Data Model
+
+Within each partition, data is organized into key-value model. Every row is 
uniquely identified with a row key. To write a row into Hudi dataset, each row 
must specify the following user fields 
+
+| User fields                 | Description                                    
              |
+| --------------------------- | 
------------------------------------------------------------ |
+| Partitioning key [Optional] | Value of this field defines the directory 
hierarchy within the table base path. This essentially provides an hierarchy 
isolation for managing data and related metadata |
+| Row key                     | Record keys uniquely identify a record/row 
within each partition if partitioning is enabled |
+| Ordering field(s)           | Hudi guarantees the uniqueness constraint of 
row key and the conflict resolution configuration manages strategies on how to 
disambiguate when multiple records with the same keys are to be merged into the 
dataset. The resolution logic can be based on an ordering field or can be 
custom, specific to the dataset. To ensure consistent behavior dealing with 
duplicate records, the resolution logic should be commutative and idempotent |
+
+**Hudi metadata fields**
+
+Hudi format stores the user fields along with the row merged along with 
transactional metadata fields. These fields are encoded in the data-file format 
and available in the table schema. 
+
+[TODO - flush this section out with details on guarantees and how to populate 
them]
+
+| Husi meta-fields              | Description                                  
                |

Review Comment:
   ```suggestion
   | Hudi meta-fields              | Description                                
                  |
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] codope commented on a diff in pull request #6268: [HUDI-4519] Initial version of the Hudi storage format specification doc

Reply via email to