[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

GitBox Thu, 21 Mar 2019 23:30:43 -0700

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268049020


 ##########
 File path: docs/concepts.md
 ##########
 @@ -35,91 +46,102 @@ When there is late arriving data (data intended for 9:00 
arriving >1 hr late at
 With the help of the timeline, an incremental query attempting to get all new 
data that was committed successfully since 10:00 hours, is able to very 
efficiently consume
 only the changed files without say scanning all the time buckets > 07:00.
 
-## Terminologies
+## File management
+Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
+very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
 
- * `Hudi Dataset`
-    A structured hive/spark dataset managed by Hudi. Hudi supports both 
partitioned and non-partitioned Hive tables.
- * `Commit`
-    A commit marks a new batch of data applied to a dataset. Hudi maintains  
monotonically increasing timestamps to track commits and guarantees that a 
commit is atomically
-    published.
- * `Commit Timeline`
-    Commit Timeline refers to the sequence of Commits that was applied in 
order on a dataset over its lifetime.
- * `File Slice`
-    Hudi provides efficient handling of updates by having a fixed mapping 
between record key to a logical file Id.
-    Hudi uses MVCC to provide atomicity and isolation of readers from a 
writer. This means that a logical fileId will
-    have many physical versions of it. Each of these physical version of a 
file represents a complete view of the
-    file as of a commit and is called File Slice
- * `File Group`
-    A file-group is a file-slice timeline. It is a list of file-slices in 
commit order. It is identified by `file id`
+Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
+`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
+ along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
+Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
+unused/older file slices to reclaim space on DFS. 
 
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
+This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
+mapped file group contains all versions of a group of records.
 
-## Storage Types
+## Storage Types & Views
+Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
+This is not to be confused with the notion of `views`, which are merely how 
the underlying data is exposed to the queries (i.e how data is read). 
+
+| Storage Type  | Supported Views |
+|-------------- |------------------|
+| Copy On Write | Read Optimized + Incremental   |
+| Merge On Read | Read Optimized + Incremental + Near Real-time |
 
-Hudi storage types capture how data is indexed & laid out on the filesystem, 
and how the above primitives and timeline activities are implemented on top of
-such organization (i.e how data is written). This is not to be confused with 
the notion of Read Optimized & Near-Real time tables, which are merely how the 
underlying data is exposed
-to the queries (i.e how data is read).
+### Storage Types
+Hudi supports the following storage types.
 
-Hudi (will) supports the following storage types.
+  - [Copy On Write](#copy-on-write-storage) : Stores data using solely 
columnar file formats (e.g parquet). Updates simply version & rewrite the files 
by performing an synchronous merge during ingestion.
+  - [Merge On Read](#merge-on-read-storage) : Stores data using a combination 
of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are 
logged to delta files & later compacted to produce new versions of columnar 
files synchronously or asynchronously.
+    
+Following table summarizes the trade-offs between these two storage types
 
-| Storage Type  | Supported Tables |
-|-------------- |------------------|
-| Copy On Write | Read Optimized   |
-| Merge On Read | Read Optimized + Near Real-time |
+| Trade-off | CopyOnWrite | MergeOnRead |
+|-------------- |------------------| ------------------|
+| Data Latency | Higher   | Lower |
+| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta 
file) |
+| Parquet File Size | Smaller (high update(I/0) cost) | Larger (low update 
cost) |
+| Write Amplification | Higher | Lower (depending on compaction strategy) |
 
-  - Copy On Write : A heavily read optimized storage type, that simply creates 
new versions of files corresponding to the records that changed.
-  - Merge On Read : Also provides a near-real time datasets in the order of 5 
mins, by shifting some of the write cost, to the reads and merging incoming and 
on-disk data on-the-fly
 
-Regardless of the storage type, Hudi organizes a datasets into a directory 
structure under a `basepath`,
-very similar to Hive tables. Dataset is broken up into partitions, which are 
folders containing files for that partition.
-Each partition uniquely identified by its `partitionpath`, which is relative 
to the basepath.
+### Views
+Hudi supports the following views of stored data
 
-Within each partition, records are distributed into multiple files. Each file 
is identified by an unique `file id` and the `commit` that
-produced the file. Multiple files can share the same file id but written at 
different commits, in case of updates.
+ - **Read Optimized View** : Queries on this view see the latest snapshot of 
the dataset as of a given commit or compaction action. 
+    The view exposes only the base/columnar files in latest file slices to the 
queries and guarantees the same columnar query performance compared to a 
non-hudi dataset. 
+ - **Incremental View** : Queries on this view only see new data written to 
the dataset, since a given commit/compaction. This view effectively provides 
change streams to enable incremental data pipelines. 
+ - **Realtime View** : Queries on this view see the latest snapshot of dataset 
as of a given delta commit action. The view provides near-real time datasets in 
few mins, 
+    by merging log and base files of the latest file slice on-the-fly.
 
-Each record is uniquely identified by a `record key` and mapped to a file id 
forever. This mapping between record key
-and file id, never changes once the first version of a record has been written 
to a file. In short, the
- `file id` identifies a group of files, that contain all versions of a group 
of records.
+Following table summarizes the trade-offs between the different views.
 
+| Trade-off | ReadOptimized | RealTime |
+|-------------- |------------------| ------------------|
+| Data Latency | Higher   | Lower |
+| Query Latency | Lower (raw columnar performance) | Higher (merge columnar + 
row based delta) |
 
-## Copy On Write
 
-As mentioned above, each commit on Copy On Write storage, produces new 
versions of files. In other words, we implicitly compact every
-commit, such that only columnar data exists. As a result, the write 
amplification (number of bytes written for 1 byte of incoming data)
- is much higher, where read amplification is close to zero. This is a much 
desired property for a system like Hadoop, which is predominantly read-heavy.
+## Copy On Write Storage
+
+File slices in Copy-On-Write storage only contain the base/columnar file and 
each commit produces new versions of base files. 
+In other words, we implicitly compact on every commit, such that only columnar 
data exists. As a result, the write amplification 
+(number of bytes written for 1 byte of incoming data) is much higher, where 
read amplification is zero. 
+This is a much desired property for analytical workloads, which is 
predominantly read-heavy.
 
 Following illustrates how this works conceptually, when  data written into 
copy-on-write storage  and two queries running on top of it.
 
 
 {% include image.html file="hudi_cow.png" alt="hudi_cow.png" %}
 
 
-As data gets written, updates to existing file ids, produce a new version for 
that file id stamped with the commit and
-inserts allocate a new file id and write its first version for that file id. 
These file versions and their commits are color coded above.
+As data gets written, updates to existing file groups, produce a new slice for 
that file group stamped with the commit and
+inserts allocate a new file group and write its first version for that file 
id. These file slices and their commit instant times are color coded above.
 Normal SQL queries running against such dataset (eg: `select count(*)` 
counting the total records in that partition), first checks the timeline for 
latest commit
-and filters all but latest versions of each file id. As you can see, an old 
query does not see the current inflight commit's files colored in pink,
+and filters all but latest file slices of each file group. As you can see, an 
old query does not see the current inflight commit's files colored in pink,
 but a new query starting after the commit picks up the new data. Thus queries 
are immune to any write failures/partial writes and only run on committed data.
 
-The intention of copy on write storage, is to fundamentally improve how 
datasets are managed today on Hadoop through
+The intention of copy on write storage, is to fundamentally improve how 
datasets are managed today through
 
   - First class support for atomically updating data at file-level, instead of 
rewriting whole tables/partitions
   - Ability to incremental consume changes, as opposed to wasteful scans or 
fumbling with heuristical approaches
   - Tight control file sizes to keep query performance excellent (small files 
hurt query performance considerably).
 
 
-## Merge On Read
+## Merge On Read Storage
 
 Merge on read storage is a superset of copy on write, in the sense it still 
provides a read optimized view of the dataset via the Read Optmized table.
-But, additionally stores incoming upserts for each file id, onto a `row based 
append log`, that enables providing near real-time data to the queries
+But, additionally stores incoming upserts for each file group, onto a row 
based append log, that enables providing near real-time data to the queries
 
 Review comment:
   onto a row based `delta file`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

Reply via email to