prasannarajaperumal commented on code in PR #6268:
URL: https://github.com/apache/hudi/pull/6268#discussion_r939979536
##########
website/src/pages/tech-specs.md:
##########
@@ -0,0 +1,350 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms
immutable cloud/file storage systems into transactional data lakes.
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query
performance optimizations
+
+Apache Hudi is an open source data lake platform that is built on top of the
Hudi Storage Format and it unlocks the following features
+
+- **Unified Computation model** - an unified way to combine large batch style
operations and frequent near real time streaming operations over a single
unified dataset
+- **Self Optimized Storage** - Automatically handle all the table storage
maintenance such as compaction, clustering, vacuuming asynchronously and
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and
ensures up-to-date metadata and indexes unlocking multi-fold read and write
performance optimizations
+- **Data processing engine neutral** - designed to be neutral and not having a
preferred computation engine. Apache Hudi will manage metadata, provide common
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into optional high level directory
structure under the base path (root directory for the Hudi table). The
directory structure is based on coarse grained partitioning values set for the
dataset. Non partitioned data sets store all the data files under the base
path. Hudi storage format has a special reserved *.hoodie* directory under the
base path that is used to store transaction logs and metadata.
+
+```
+/data/hudi_trips/
+├── .hoodie/
+│ └── metadata/
+├── americas/
+│ ├── brazil/
+│ │ └── sao_paulo/
+│ │ ├── <data_files>
+│ └── united_states/
+│ └── san_francisco/
+│ ├── <data_files>
+└── asia/
+ └── india/
+ └── chennai/
+ ├── <data_files>
+```
+
+Hudi storage format offers two table types offering different trade-offs
between ingest and query performance and the data files are stored differently
based on the chosen table type.
+
+| Table Type | Tradeoff
|
+| ------------- | ------------------------------------------------------------
|
+| Copy on Write | Optimized for read performance and ideal for slow changing
datasets |
+| Merge-on-read | Optimized to balance the write and read performance and
ideal for frequently changing datasets |
+
+
+
+### Data Model
+
+Within each partition, data is organized into key-value model. Every row is
uniquely identified with a row key. To write a row into Hudi dataset, each row
must specify the following user fields
+
+| User fields | Description
|
+| --------------------------- |
------------------------------------------------------------ |
+| Partitioning key [Optional] | Value of this field defines the directory
hierarchy within the table base path. This essentially provides an hierarchy
isolation for managing data and related metadata |
Review Comment:
I think the order of data modeing is in this order
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]