prasannarajaperumal commented on code in PR #6268:
URL: https://github.com/apache/hudi/pull/6268#discussion_r942163446
##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,371 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms
immutable cloud/file storage systems into transactional data lakes.
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query
performance optimizations
+
+Apache Hudi is an open source data lake platform that is built on top of the
Hudi Storage Format and it unlocks the following features
+
+- **Unified Computation model** - an unified way to combine large batch style
operations and frequent near real time streaming operations over a single
unified dataset
+- **Self-Optimized Storage** - Automatically handle all the table storage
maintenance such as compaction, clustering, vacuuming asynchronously and
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and
ensures up-to-date metadata and indexes unlocking multi-fold read and write
performance optimizations
+- **Engine neutrality** - designed to be neutral and not having a preferred
computation engine. Apache Hudi will manage metadata, provide common
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into a high level directory structure
under the base path (root directory for the Hudi table). The directory
structure is based on coarse-grained partitioning values set for the dataset.
Non-partitioned data sets store all the data files under the base path. Hudi
storage format has a special reserved *.hoodie* directory under the base path
that is used to store transaction logs and metadata.
+
+```
+/data/hudi_trips/ <== BASE PATH
+├── .hoodie/ <== META BASE PATH
+│ └── metadata/
+├── americas/
+│ ├── brazil/
+│ │ └── sao_paulo/ <== PARTITIONED DIRECTORY
+│ │ ├── <data_files>
+│ └── united_states/
+│ └── san_francisco/
+│ ├── <data_files>
+└── asia/
+ └── india/
+ └── chennai/
+ ├── <data_files>
+```
+
+Hudi storage format offers two table types offering different trade-offs
between ingest and query performance and the data files are stored differently
based on the chosen table type.
+
+| Table Type | Trade-off
|
+| ------------- | ------------------------------------------------------------
|
+| Copy on Write | Optimized for read performance and ideal for slow changing
datasets |
+| Merge-on-read | Optimized to balance the write and read performance and
ideal for frequently changing datasets |
+
+
+
+### Data Model
+
+Within each partition, data is organized into key-value model. Every row is
uniquely identified with a row key. To write a row into Hudi dataset, each row
must specify the following user fields
+
+| User fields | Description
|
+| --------------------------- |
------------------------------------------------------------ |
+| Partitioning key [Optional] | Value of this field defines the directory
hierarchy within the table base path. This essentially provides an hierarchy
isolation for managing data and related metadata |
+| Row key(s) | Record keys uniquely identify a record/row
within each partition if partitioning is enabled |
+| Ordering field(s) | Hudi guarantees the uniqueness constraint of
row key and the conflict resolution configuration manages strategies on how to
disambiguate when multiple records with the same keys are to be merged into the
dataset. The resolution logic can be based on an ordering field or can be
custom, specific to the dataset. To ensure consistent behaviour dealing with
duplicate records, the resolution logic should be commutative and idempotent |
+
+**Hudi metadata fields**
+
+Hudi format stores the user fields along with the row merged along with
transactional metadata fields. These fields are encoded in the data-file format
and available in the table schema.
+
+| Hudi meta-fields | Description
|
+| ---------------------------- |
------------------------------------------------------------ |
+| _hoodie_commit_time [string] | Every modification to a Hudi dataset creates
an entry into the Transaction timeline. This entry is identified with the
commit time. This field matches to the commit time of the instant in the
timeline that created this record. More on how to populate this in Hudi
transactions section below. |
+| _hoodie_record_key | Unique record key identifying the row within
the partition |
+| _hoodie_partition_path | Partition path under which the data is
organized into |
+| _hoodie_file_name | The data file name this record belongs to
|
+| _hoodie_is_deleted | Tombstone field to denote the record key is
deleted |
+
+
+
+## Transaction Log (Timeline)
+
+Data consistency in Hudi is provided using Multi-version Concurrency Control
(MVCC). Every transactional action on the Hudi table creates a new entry
(instant) in the timeline. All transactional actions follows the state
transition below
+
+* **requested** - Action is requested to start on the timeline.
+* **inflight** - Action has started running and is currently in-flight
+* **completed** - Action has completed running
+
+All actions and the state transitions are registered with the timeline using
an atomic put of special meta-file inside the *.hoodie* directory. The
requirement from the underlying storage system is to support an atomic-put and
read-after-write consistency. The meta file structure is as follows
+
+```
+[Action timestamp].[Action type].[Action state]
+```
+
+* **Action timestamp** - Monotonically increasing value to denote strict
ordering of actions in the timeline. This could be provided by an external
token provider or rely on the system epoch time at millisecond granularity.
+
+* **Action type** - Type of action. The following are the position actions on
the Hudi timeline.
+
+* | Action type | Description
|
+ | ------------- |
------------------------------------------------------------ |
+ | commit | Commit denotes an **atomic write (inserts, updates and
deletes)** of records in a table. A commit in Hudi is an atomic way of updating
data, metadata and indexes. The guarantee is that all or none the changes
within a commit will be visible to the readers |
+ | deltacommit | Special version of `commit` which is applicable only on a
Merge-on-Read storage engine. The writes are accumulated and batched to improve
write performance |
+ | rollback | Rollback denotes that the changes made by the
corresponding commit/delta commit was unsuccessful & hence rolled back,
removing any partial files produced during such a write |
+ | savepoint | Savepoint is a special marker to ensure a particular
commit is not automatically cleaned. It helps restore the table to a point on
the timeline, in case of disaster/data recovery scenarios |
+ | restore | Restore denotes that a particular Savepoint was restored
|
+ | clean | Maintenance activity that cleans up versions of data files
that no longer will be accessed |
+ | compaction | Maintenance to optimize the storage for query performance.
This action applies the batched up updates from `deltacommit` and re-optimizes
data files for query performance |
+ | replacecommit | Maintenance activity to cluster the data for better query
performance. This action is different from a `commit` in that the the table
state before and after are logically equivalent |
+ | indexing | Maintenance activity to update the index with the data.
This action does not change data, only updates the index aynchronously to data
changes |
+
+ **Action state** - Denotes the state transition identifier (requested ->
inflight -> completed)
+
+Meta-files with requested transaction state contains any planning details, If
an action requires generating a plan of execution, this is done before
requesting and is persisted in the Meta-file. The data is written in Json and
the Avro schema for each of the requested actions are as follows
+
+* `replacecommit` -
[HoodieRequestedReplaceMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieRequestedReplaceMetadata.avsc)
+* `restore ` -
[HoodieRestorePlan](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieRestorePlan.avsc)
+* `rollback`-
[HoodieRollbackPlan](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieRollbackPlan.avsc)
+* `clean` -
[HoodieCleanerPlan](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieCleanerPlan.avsc)
+* `indexing` -
[HoodieIndexPlan](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieIndexPlan.avsc)
+
+Meta-files with completed transaction state contains details about the
transaction completed such as the number of inserts/updates/deletes per file
ID, file size, and some extra metadata such as checkpoint and schema for the
batch of records written. The data is written in Json and the Avro schema for
each of these actions are as follows
+
+- `commit ` -
[HoodieCommitMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieCommitMetadata.avsc)
+- `deltacommit` -
[HoodieCommitMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieCommitMetadata.avsc)
+- `rollback`-
[HoodieRollbackMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieRollbackMetadata.avsc)
+- `savepoint` -
[HoodieSavepointMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieSavePointMetadata.avsc)
+- `restore ` -
[HoodieRestoreMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieRestoreMetadata.avsc)
+- `clean` -
[HoodieCleanMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieCleanMetadata.avsc)
+- `compaction` -
[HoodieCompactionMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieCompactionMetadata.avsc)
+- `replacecommit` -
[HoodieReplaceCommitMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieReplaceCommitMetadata.avsc)
+- `indexing` -
[HoodieIndexCommitMetadata](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieIndexCommitMetadata.avsc)
+
+Reconciling all the actions in the timeline, the state of the Hudi dataset can
be re-created at any instant of time.
+
+
+
+## **Metadata**
+
+Hudi automatically extracts the physical data statistics and stores the
metadata along with the data to improve write and query performance. Hudi
Metadata is an internally-managed table which organizes the table metadata
under the base path *.hoodie/metadata.* The data format used is similar to the
merge-on-read data format. Every record stored in the metadata table is a Hudi
row and hence has partitioning key and row key specified. Following are the
metadata table partitions
+
+- **files** - Partition path to file name index. Key for the Hudi record is
the partition path and the actual record is a map of file name to an instance
of
[HoodieMetadataFileInfo](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieMetadata.avsc#L34)
. The files index can be used to do file listing and do filter based pruning
of the scanset during query
+
+- **bloom_filters** - Bloom filter index to help map a record key to the
actual file. The Hudi key is `hash(partition name) + hash(file name)` and the
actual payload is an instance of
[HudiMetadataBloomFilter](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieMetadata.avsc#L66)
. Bloom filter helps with identifying the files to update and optimize joins
with row key and point queries with row key filter.
+- **column_stats** - contains statistics of all the columns for all the rows
in the table. This enables file grained file pruning for filters and join
conditions in the query. The actual payload is an instance of
[HoodieMetadataColumnStats](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieMetadata.avsc#L101).
+
+
+
+## Data Organization
+
+As mentioned in the data model, data is partitioned coarsely through a
directory hierarchy based on the partition path configured. Within each
partition the data is physically stored as **base and log files** and organized
into logical concepts as **File groups and File slices**. The logical concepts
will be referred by the writer / reader requirements.
+
+**File group** - Groups multiple versions of a base file. File group is
uniquely identified by a File id. Each version corresponds to the commit
instants timestamp recording updates to rows in the file. The base file are
stored in open source data formats like Apache Parquet, Apache ORC, Apache
HBase HFile etc.
+
+**File slice** - A File group can further be split into multiple files. A base
file corresponding to commit timestamp and a set of log files that batches the
deletes/updates to the base file and each log file corresponds to a delta
commit timestamp and the delta commits happen after the base file commit in the
Hudi timeline.
+
+
+
+### **Base file**
+
+The base file name format is:
+
+```
+[File Id]_[File Write Token]_[Transaction timestamp].[File Extension]
+```
+
+- **File Id** - Uniquely identify a base file within a partition. Multiple
versions of the base file share the same file id.
+- **File Write Token** - Monotonically increasing token for every attempt to
write the base file. This should help uniquely identifying the base file when
there are failures and retries. Cleaner can clean-up partial base files if the
write token is not the latest in the file group
Review Comment:
We just describe what is the format of the base file naming here. I think
since we specify that this field is used for retries - it is clear that this is
useful within the same engine context?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]