subject:"\[GitHub\] \[incubator\-hudi\] n3nash commented on a change in pull request #610\: Major cleanup of docs structure\/content"

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-22 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268285521
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -35,91 +46,102 @@ When there is late arriving data (data intended for 9:00 
arriving >1 hr late at
 With the help of the timeline, an incremental query attempting to get all new 
data that was committed successfully since 10:00 hours, is able to very 
efficiently consume
 only the changed files without say scanning all the time buckets > 07:00.
 
-## Terminologies
+## File management
+Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
+very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
 
- * `Hudi Dataset`
-A structured hive/spark dataset managed by Hudi. Hudi supports both 
partitioned and non-partitioned Hive tables.
- * `Commit`
-A commit marks a new batch of data applied to a dataset. Hudi maintains  
monotonically increasing timestamps to track commits and guarantees that a 
commit is atomically
-published.
- * `Commit Timeline`
-Commit Timeline refers to the sequence of Commits that was applied in 
order on a dataset over its lifetime.
- * `File Slice`
-Hudi provides efficient handling of updates by having a fixed mapping 
between record key to a logical file Id.
-Hudi uses MVCC to provide atomicity and isolation of readers from a 
writer. This means that a logical fileId will
-have many physical versions of it. Each of these physical version of a 
file represents a complete view of the
-file as of a commit and is called File Slice
- * `File Group`
-A file-group is a file-slice timeline. It is a list of file-slices in 
commit order. It is identified by `file id`
+Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
+`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
+ along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
+Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
+unused/older file slices to reclaim space on DFS. 
 
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
+This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
+mapped file group contains all versions of a group of records.
 
-## Storage Types
+## Storage Types & Views
+Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
+This is not to be confused with the notion of `views`, which are merely how 
the underlying data is exposed to the queries (i.e how data is read). 
+
+| Storage Type  | Supported Views |
+|-- |--|
+| Copy On Write | Read Optimized + Incremental   |
+| Merge On Read | Read Optimized + Incremental + Near Real-time |
 
-Hudi storage types capture how data is indexed & laid out on the filesystem, 
and how the above primitives and timeline activities are implemented on top of
-such organization (i.e how data is written). This is not to be confused with 
the notion of Read Optimized & Near-Real time tables, which are merely how the 
underlying data is exposed
-to the queries (i.e how data is read).
+### Storage Types
+Hudi supports the following storage types.
 
-Hudi (will) supports the following storage types.
+  - [Copy On Write](#copy-on-write-storage) : Stores data using solely 
columnar file formats (e.g parquet). Updates simply version & rewrite the files 
by performing an synchronous merge during ingestion.
+  - [Merge On Read](#merge-on-read-storage) : Stores data using a combination 
of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are 
logged to delta files & later compacted to produce new versions of columnar 
files synchronously or asynchronously.
+
+Following table summarizes the trade-offs between these two storage types
 
-| Storage Type  | Supported Tables |
-|-- |--|
-| Copy On Write | Read Optimized   |
-| Merge On Read | Read Optimized + Near Real-time |
+| Trade-off | CopyOnWrite | MergeOnRead |
+|-- |--| --|
+| Data Latency | Higher   | Lower |
+| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta 
file) |
+| Parquet File Size | Smaller (high

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-22 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268063902

##
File path: docs/querying_data.md
##
@@ -7,57 +7,131 @@ toc: false
summary: In this page, we go over how to enable SQL queries on Hudi built
tables.
---

-Hudi registers the dataset into the Hive metastore backed by
`HoodieInputFormat`. This makes the data accessible to
-Hive & Spark & Presto automatically. To be able to perform normal SQL queries
on such a dataset, we need to get the individual query engines
-to call `HoodieInputFormat.getSplits()`, during query planning such that the
right versions of files are exposed to it.
+Conceptually, Hudi stores data physically once on DFS, while providing 3
logical views on top, as explained [before](concepts.html#views).
+Once the dataset is synced to the Hive metastore, it provides external Hive
tables backed by Hudi's custom inputformats. Once the proper hudi
+bundle has been provided, the dataset can be queried by popular query engines
like Hive, Spark and Presto.

+Specifically, there are two Hive tables named off [table
name](configurations.html#TABLE_NAME_OPT_KEY) passed during write.
+For e.g, if `table name = hudi_tbl`, then we get

-In the following sections, we cover the configs needed across different query
engines to achieve this.
+ - `hudi_tbl` realizes the read optimized view of the dataset backed by
`HoodieInputFormat`, exposing purely columnar data.
+ - `hudi_tbl_rt` realizes the real time view of the dataset backed by
`HoodieRealtimeInputFormat`, exposing merged view of base and log data.

-{% include callout.html content="Instructions are currently only for
Copy-on-write storage" type="info" %}
+As discussed in the concepts section, the one key primitive needed for
[incrementally
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
+is `incremental pulls` (to obtain a change stream/log from a dataset). Hudi
datasets can be pulled incrementally, which means you can get ALL and ONLY the
updated & new rows
+since a specified instant time. This, together with upserts, are particularly
useful for building data pipelines where 1 or more source Hudi tables are
incrementally pulled (streams/facts),
+joined with other tables (datasets/dimensions), to [write out
deltas](writing_data.html) to a target Hudi dataset. Incremental view is
realized by querying one of the tables above,
+with special configurations that indicates to query planning that only
incremental data needs to be fetched out of the dataset.

+In sections, below we will discuss in detail how to access all the 3 views on
each query engine.

## Hive

-For HiveServer2 access,
[install](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr)
-the hoodie-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar into the aux jars path and we
should be able to recognize the Hudi tables and query them correctly.
-
-For beeline access, the `hive.input.format` variable needs to be set to the
fully qualified path name of the inputformat
`com.uber.hoodie.hadoop.HoodieInputFormat`
-For Tez, additionally the `hive.tez.input.format` needs to be set to
`org.apache.hadoop.hive.ql.io.HiveInputFormat`
+In order for Hive to recognize Hudi datasets and query correctly, the
HiveServer2 needs to be provided with the
`hoodie-hadoop-hive-bundle-x.y.z-SNAPSHOT.jar`
+in its [aux jars
path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr).
This will ensure the input format
+classes with its dependencies are available for query planning & execution.
+
+### Read Optimized table {#hive-ro-view}
+In addition to setup above, for beeline cli access, the `hive.input.format`
variable needs to be set to the fully qualified path name of the
+inputformat `com.uber.hoodie.hadoop.HoodieInputFormat`. For Tez, additionally
the `hive.tez.input.format` needs to be set
+to `org.apache.hadoop.hive.ql.io.HiveInputFormat`
+
+### Real time table {#hive-rt-view}
+In addition to installing the hive bundle jar on the HiveServer2, it needs to
be put on the hadoop/hive installation across the cluster, so that
+queries can pick up the custom RecordReader as well.
+
+### Incremental Pulling {#hive-incr-pull}
+
+`HiveIncrementalPuller` allows the incrementally extracting changes from large
fact/dimension tables via HiveQL, combining the benefits of Hive (reliably
process complex SQL queries) and
+incremental primitives (speed up query by pulling tables incrementally instead
of scanning fully). The tool uses Hive JDBC to run the Hive query saving its
results in a temp table.
+that can later be upserted. Upsert utility (`HoodieDeltaStreamer`) has all the
state it needs from the directory structure to know what should be the commit
time on the target table.
+e.g:

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-22 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268063902

##
File path: docs/querying_data.md
##
@@ -7,57 +7,131 @@ toc: false
summary: In this page, we go over how to enable SQL queries on Hudi built
tables.
---

+Specifically, there are two Hive tables named off [table
name](configurations.html#TABLE_NAME_OPT_KEY) passed during write.
+For e.g, if `table name = hudi_tbl`, then we get

+In sections, below we will discuss in detail how to access all the 3 views on
each query engine.

## Hive

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-22 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268063284

##
File path: docs/querying_data.md
##
@@ -7,57 +7,131 @@ toc: false
summary: In this page, we go over how to enable SQL queries on Hudi built
tables.
---

+Specifically, there are two Hive tables named off [table
name](configurations.html#TABLE_NAME_OPT_KEY) passed during write.
+For e.g, if `table name = hudi_tbl`, then we get

+In sections, below we will discuss in detail how to access all the 3 views on
each query engine.

Review comment:
s/,//

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-22 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268062266
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -133,23 +155,7 @@ There are lot of interesting things happening in this 
example, which bring out t
  strategy, where we aggressively compact the latest partitions compared to 
older partitions, we could ensure the RO Table sees data
  published within X minutes in a consistent fashion.
 
-The intention of merge on read storage, is to enable near real-time processing 
directly on top of Hadoop, as opposed to copying
+The intention of merge on read storage, is to enable near real-time processing 
directly on top of DFS, as opposed to copying
 
 Review comment:
   The primary intention of merge on read storage is to enable near real-time 
processing directly on top of DFS
   There are few secondary side benefits to this storage such as reduced write 
amplification by avoiding synchronous merge of data, i.e, the amount of data 
written per 1 bytes of data in a batch


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-22 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268049189
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -35,91 +46,102 @@ When there is late arriving data (data intended for 9:00 
arriving >1 hr late at
 With the help of the timeline, an incremental query attempting to get all new 
data that was committed successfully since 10:00 hours, is able to very 
efficiently consume
 only the changed files without say scanning all the time buckets > 07:00.
 
-## Terminologies
+## File management
+Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
+very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
 
- * `Hudi Dataset`
-A structured hive/spark dataset managed by Hudi. Hudi supports both 
partitioned and non-partitioned Hive tables.
- * `Commit`
-A commit marks a new batch of data applied to a dataset. Hudi maintains  
monotonically increasing timestamps to track commits and guarantees that a 
commit is atomically
-published.
- * `Commit Timeline`
-Commit Timeline refers to the sequence of Commits that was applied in 
order on a dataset over its lifetime.
- * `File Slice`
-Hudi provides efficient handling of updates by having a fixed mapping 
between record key to a logical file Id.
-Hudi uses MVCC to provide atomicity and isolation of readers from a 
writer. This means that a logical fileId will
-have many physical versions of it. Each of these physical version of a 
file represents a complete view of the
-file as of a commit and is called File Slice
- * `File Group`
-A file-group is a file-slice timeline. It is a list of file-slices in 
commit order. It is identified by `file id`
+Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
+`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
+ along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
+Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
+unused/older file slices to reclaim space on DFS. 
 
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
+This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
+mapped file group contains all versions of a group of records.
 
-## Storage Types
+## Storage Types & Views
+Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
+This is not to be confused with the notion of `views`, which are merely how 
the underlying data is exposed to the queries (i.e how data is read). 
+
+| Storage Type  | Supported Views |
+|-- |--|
+| Copy On Write | Read Optimized + Incremental   |
+| Merge On Read | Read Optimized + Incremental + Near Real-time |
 
-Hudi storage types capture how data is indexed & laid out on the filesystem, 
and how the above primitives and timeline activities are implemented on top of
-such organization (i.e how data is written). This is not to be confused with 
the notion of Read Optimized & Near-Real time tables, which are merely how the 
underlying data is exposed
-to the queries (i.e how data is read).
+### Storage Types
+Hudi supports the following storage types.
 
-Hudi (will) supports the following storage types.
+  - [Copy On Write](#copy-on-write-storage) : Stores data using solely 
columnar file formats (e.g parquet). Updates simply version & rewrite the files 
by performing an synchronous merge during ingestion.
+  - [Merge On Read](#merge-on-read-storage) : Stores data using a combination 
of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are 
logged to delta files & later compacted to produce new versions of columnar 
files synchronously or asynchronously.
+
+Following table summarizes the trade-offs between these two storage types
 
-| Storage Type  | Supported Tables |
-|-- |--|
-| Copy On Write | Read Optimized   |
-| Merge On Read | Read Optimized + Near Real-time |
+| Trade-off | CopyOnWrite | MergeOnRead |
+|-- |--| --|
+| Data Latency | Higher   | Lower |
+| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta 
file) |
+| Parquet File Size | Smaller (high

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-22 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268049189
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -35,91 +46,102 @@ When there is late arriving data (data intended for 9:00 
arriving >1 hr late at
 With the help of the timeline, an incremental query attempting to get all new 
data that was committed successfully since 10:00 hours, is able to very 
efficiently consume
 only the changed files without say scanning all the time buckets > 07:00.
 
-## Terminologies
+## File management
+Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
+very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
 
- * `Hudi Dataset`
-A structured hive/spark dataset managed by Hudi. Hudi supports both 
partitioned and non-partitioned Hive tables.
- * `Commit`
-A commit marks a new batch of data applied to a dataset. Hudi maintains  
monotonically increasing timestamps to track commits and guarantees that a 
commit is atomically
-published.
- * `Commit Timeline`
-Commit Timeline refers to the sequence of Commits that was applied in 
order on a dataset over its lifetime.
- * `File Slice`
-Hudi provides efficient handling of updates by having a fixed mapping 
between record key to a logical file Id.
-Hudi uses MVCC to provide atomicity and isolation of readers from a 
writer. This means that a logical fileId will
-have many physical versions of it. Each of these physical version of a 
file represents a complete view of the
-file as of a commit and is called File Slice
- * `File Group`
-A file-group is a file-slice timeline. It is a list of file-slices in 
commit order. It is identified by `file id`
+Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
+`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
+ along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
+Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
+unused/older file slices to reclaim space on DFS. 
 
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
+This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
+mapped file group contains all versions of a group of records.
 
-## Storage Types
+## Storage Types & Views
+Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
+This is not to be confused with the notion of `views`, which are merely how 
the underlying data is exposed to the queries (i.e how data is read). 
+
+| Storage Type  | Supported Views |
+|-- |--|
+| Copy On Write | Read Optimized + Incremental   |
+| Merge On Read | Read Optimized + Incremental + Near Real-time |
 
-Hudi storage types capture how data is indexed & laid out on the filesystem, 
and how the above primitives and timeline activities are implemented on top of
-such organization (i.e how data is written). This is not to be confused with 
the notion of Read Optimized & Near-Real time tables, which are merely how the 
underlying data is exposed
-to the queries (i.e how data is read).
+### Storage Types
+Hudi supports the following storage types.
 
-Hudi (will) supports the following storage types.
+  - [Copy On Write](#copy-on-write-storage) : Stores data using solely 
columnar file formats (e.g parquet). Updates simply version & rewrite the files 
by performing an synchronous merge during ingestion.
+  - [Merge On Read](#merge-on-read-storage) : Stores data using a combination 
of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are 
logged to delta files & later compacted to produce new versions of columnar 
files synchronously or asynchronously.
+
+Following table summarizes the trade-offs between these two storage types
 
-| Storage Type  | Supported Tables |
-|-- |--|
-| Copy On Write | Read Optimized   |
-| Merge On Read | Read Optimized + Near Real-time |
+| Trade-off | CopyOnWrite | MergeOnRead |
+|-- |--| --|
+| Data Latency | Higher   | Lower |
+| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta 
file) |
+| Parquet File Size | Smaller (high

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-22 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268048966
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -35,91 +46,102 @@ When there is late arriving data (data intended for 9:00 
arriving >1 hr late at
 With the help of the timeline, an incremental query attempting to get all new 
data that was committed successfully since 10:00 hours, is able to very 
efficiently consume
 only the changed files without say scanning all the time buckets > 07:00.
 
-## Terminologies
+## File management
+Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
+very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
 
- * `Hudi Dataset`
-A structured hive/spark dataset managed by Hudi. Hudi supports both 
partitioned and non-partitioned Hive tables.
- * `Commit`
-A commit marks a new batch of data applied to a dataset. Hudi maintains  
monotonically increasing timestamps to track commits and guarantees that a 
commit is atomically
-published.
- * `Commit Timeline`
-Commit Timeline refers to the sequence of Commits that was applied in 
order on a dataset over its lifetime.
- * `File Slice`
-Hudi provides efficient handling of updates by having a fixed mapping 
between record key to a logical file Id.
-Hudi uses MVCC to provide atomicity and isolation of readers from a 
writer. This means that a logical fileId will
-have many physical versions of it. Each of these physical version of a 
file represents a complete view of the
-file as of a commit and is called File Slice
- * `File Group`
-A file-group is a file-slice timeline. It is a list of file-slices in 
commit order. It is identified by `file id`
+Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
+`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
+ along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
+Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
+unused/older file slices to reclaim space on DFS. 
 
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
+This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
+mapped file group contains all versions of a group of records.
 
-## Storage Types
+## Storage Types & Views
+Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
+This is not to be confused with the notion of `views`, which are merely how 
the underlying data is exposed to the queries (i.e how data is read). 
+
+| Storage Type  | Supported Views |
+|-- |--|
+| Copy On Write | Read Optimized + Incremental   |
+| Merge On Read | Read Optimized + Incremental + Near Real-time |
 
-Hudi storage types capture how data is indexed & laid out on the filesystem, 
and how the above primitives and timeline activities are implemented on top of
-such organization (i.e how data is written). This is not to be confused with 
the notion of Read Optimized & Near-Real time tables, which are merely how the 
underlying data is exposed
-to the queries (i.e how data is read).
+### Storage Types
+Hudi supports the following storage types.
 
-Hudi (will) supports the following storage types.
+  - [Copy On Write](#copy-on-write-storage) : Stores data using solely 
columnar file formats (e.g parquet). Updates simply version & rewrite the files 
by performing an synchronous merge during ingestion.
+  - [Merge On Read](#merge-on-read-storage) : Stores data using a combination 
of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are 
logged to delta files & later compacted to produce new versions of columnar 
files synchronously or asynchronously.
+
+Following table summarizes the trade-offs between these two storage types
 
-| Storage Type  | Supported Tables |
-|-- |--|
-| Copy On Write | Read Optimized   |
-| Merge On Read | Read Optimized + Near Real-time |
+| Trade-off | CopyOnWrite | MergeOnRead |
+|-- |--| --|
+| Data Latency | Higher   | Lower |
+| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta 
file) |
+| Parquet File Size | Smaller (high

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-22 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268048778
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -35,91 +46,102 @@ When there is late arriving data (data intended for 9:00 
arriving >1 hr late at
 With the help of the timeline, an incremental query attempting to get all new 
data that was committed successfully since 10:00 hours, is able to very 
efficiently consume
 only the changed files without say scanning all the time buckets > 07:00.
 
-## Terminologies
+## File management
+Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
+very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
 
- * `Hudi Dataset`
-A structured hive/spark dataset managed by Hudi. Hudi supports both 
partitioned and non-partitioned Hive tables.
- * `Commit`
-A commit marks a new batch of data applied to a dataset. Hudi maintains  
monotonically increasing timestamps to track commits and guarantees that a 
commit is atomically
-published.
- * `Commit Timeline`
-Commit Timeline refers to the sequence of Commits that was applied in 
order on a dataset over its lifetime.
- * `File Slice`
-Hudi provides efficient handling of updates by having a fixed mapping 
between record key to a logical file Id.
-Hudi uses MVCC to provide atomicity and isolation of readers from a 
writer. This means that a logical fileId will
-have many physical versions of it. Each of these physical version of a 
file represents a complete view of the
-file as of a commit and is called File Slice
- * `File Group`
-A file-group is a file-slice timeline. It is a list of file-slices in 
commit order. It is identified by `file id`
+Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
+`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
+ along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
+Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
+unused/older file slices to reclaim space on DFS. 
 
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
+This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
+mapped file group contains all versions of a group of records.
 
-## Storage Types
+## Storage Types & Views
+Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
+This is not to be confused with the notion of `views`, which are merely how 
the underlying data is exposed to the queries (i.e how data is read). 
+
+| Storage Type  | Supported Views |
+|-- |--|
+| Copy On Write | Read Optimized + Incremental   |
+| Merge On Read | Read Optimized + Incremental + Near Real-time |
 
-Hudi storage types capture how data is indexed & laid out on the filesystem, 
and how the above primitives and timeline activities are implemented on top of
-such organization (i.e how data is written). This is not to be confused with 
the notion of Read Optimized & Near-Real time tables, which are merely how the 
underlying data is exposed
-to the queries (i.e how data is read).
+### Storage Types
+Hudi supports the following storage types.
 
-Hudi (will) supports the following storage types.
+  - [Copy On Write](#copy-on-write-storage) : Stores data using solely 
columnar file formats (e.g parquet). Updates simply version & rewrite the files 
by performing an synchronous merge during ingestion.
+  - [Merge On Read](#merge-on-read-storage) : Stores data using a combination 
of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are 
logged to delta files & later compacted to produce new versions of columnar 
files synchronously or asynchronously.
+
+Following table summarizes the trade-offs between these two storage types
 
-| Storage Type  | Supported Tables |
-|-- |--|
-| Copy On Write | Read Optimized   |
-| Merge On Read | Read Optimized + Near Real-time |
+| Trade-off | CopyOnWrite | MergeOnRead |
+|-- |--| --|
+| Data Latency | Higher   | Lower |
+| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta 
file) |
+| Parquet File Size | Smaller (high

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-22 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268048823
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -35,91 +46,102 @@ When there is late arriving data (data intended for 9:00 
arriving >1 hr late at
 With the help of the timeline, an incremental query attempting to get all new 
data that was committed successfully since 10:00 hours, is able to very 
efficiently consume
 only the changed files without say scanning all the time buckets > 07:00.
 
-## Terminologies
+## File management
+Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
+very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
 
- * `Hudi Dataset`
-A structured hive/spark dataset managed by Hudi. Hudi supports both 
partitioned and non-partitioned Hive tables.
- * `Commit`
-A commit marks a new batch of data applied to a dataset. Hudi maintains  
monotonically increasing timestamps to track commits and guarantees that a 
commit is atomically
-published.
- * `Commit Timeline`
-Commit Timeline refers to the sequence of Commits that was applied in 
order on a dataset over its lifetime.
- * `File Slice`
-Hudi provides efficient handling of updates by having a fixed mapping 
between record key to a logical file Id.
-Hudi uses MVCC to provide atomicity and isolation of readers from a 
writer. This means that a logical fileId will
-have many physical versions of it. Each of these physical version of a 
file represents a complete view of the
-file as of a commit and is called File Slice
- * `File Group`
-A file-group is a file-slice timeline. It is a list of file-slices in 
commit order. It is identified by `file id`
+Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
+`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
+ along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
+Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
+unused/older file slices to reclaim space on DFS. 
 
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
+This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
+mapped file group contains all versions of a group of records.
 
-## Storage Types
+## Storage Types & Views
+Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
+This is not to be confused with the notion of `views`, which are merely how 
the underlying data is exposed to the queries (i.e how data is read). 
+
+| Storage Type  | Supported Views |
+|-- |--|
+| Copy On Write | Read Optimized + Incremental   |
+| Merge On Read | Read Optimized + Incremental + Near Real-time |
 
-Hudi storage types capture how data is indexed & laid out on the filesystem, 
and how the above primitives and timeline activities are implemented on top of
-such organization (i.e how data is written). This is not to be confused with 
the notion of Read Optimized & Near-Real time tables, which are merely how the 
underlying data is exposed
-to the queries (i.e how data is read).
+### Storage Types
+Hudi supports the following storage types.
 
-Hudi (will) supports the following storage types.
+  - [Copy On Write](#copy-on-write-storage) : Stores data using solely 
columnar file formats (e.g parquet). Updates simply version & rewrite the files 
by performing an synchronous merge during ingestion.
+  - [Merge On Read](#merge-on-read-storage) : Stores data using a combination 
of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are 
logged to delta files & later compacted to produce new versions of columnar 
files synchronously or asynchronously.
+
+Following table summarizes the trade-offs between these two storage types
 
-| Storage Type  | Supported Tables |
-|-- |--|
-| Copy On Write | Read Optimized   |
-| Merge On Read | Read Optimized + Near Real-time |
+| Trade-off | CopyOnWrite | MergeOnRead |
+|-- |--| --|
+| Data Latency | Higher   | Lower |
+| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta 
file) |
+| Parquet File Size | Smaller (high

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-22 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268048668
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -35,91 +46,102 @@ When there is late arriving data (data intended for 9:00 
arriving >1 hr late at
 With the help of the timeline, an incremental query attempting to get all new 
data that was committed successfully since 10:00 hours, is able to very 
efficiently consume
 only the changed files without say scanning all the time buckets > 07:00.
 
-## Terminologies
+## File management
+Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
+very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
 
- * `Hudi Dataset`
-A structured hive/spark dataset managed by Hudi. Hudi supports both 
partitioned and non-partitioned Hive tables.
- * `Commit`
-A commit marks a new batch of data applied to a dataset. Hudi maintains  
monotonically increasing timestamps to track commits and guarantees that a 
commit is atomically
-published.
- * `Commit Timeline`
-Commit Timeline refers to the sequence of Commits that was applied in 
order on a dataset over its lifetime.
- * `File Slice`
-Hudi provides efficient handling of updates by having a fixed mapping 
between record key to a logical file Id.
-Hudi uses MVCC to provide atomicity and isolation of readers from a 
writer. This means that a logical fileId will
-have many physical versions of it. Each of these physical version of a 
file represents a complete view of the
-file as of a commit and is called File Slice
- * `File Group`
-A file-group is a file-slice timeline. It is a list of file-slices in 
commit order. It is identified by `file id`
+Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
+`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
+ along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
+Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
+unused/older file slices to reclaim space on DFS. 
 
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
+This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
+mapped file group contains all versions of a group of records.
 
-## Storage Types
+## Storage Types & Views
+Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
+This is not to be confused with the notion of `views`, which are merely how 
the underlying data is exposed to the queries (i.e how data is read). 
+
+| Storage Type  | Supported Views |
+|-- |--|
+| Copy On Write | Read Optimized + Incremental   |
+| Merge On Read | Read Optimized + Incremental + Near Real-time |
 
-Hudi storage types capture how data is indexed & laid out on the filesystem, 
and how the above primitives and timeline activities are implemented on top of
-such organization (i.e how data is written). This is not to be confused with 
the notion of Read Optimized & Near-Real time tables, which are merely how the 
underlying data is exposed
-to the queries (i.e how data is read).
+### Storage Types
+Hudi supports the following storage types.
 
-Hudi (will) supports the following storage types.
+  - [Copy On Write](#copy-on-write-storage) : Stores data using solely 
columnar file formats (e.g parquet). Updates simply version & rewrite the files 
by performing an synchronous merge during ingestion.
+  - [Merge On Read](#merge-on-read-storage) : Stores data using a combination 
of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are 
logged to delta files & later compacted to produce new versions of columnar 
files synchronously or asynchronously.
+
+Following table summarizes the trade-offs between these two storage types
 
-| Storage Type  | Supported Tables |
-|-- |--|
-| Copy On Write | Read Optimized   |
-| Merge On Read | Read Optimized + Near Real-time |
+| Trade-off | CopyOnWrite | MergeOnRead |
+|-- |--| --|
+| Data Latency | Higher   | Lower |
+| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta 
file) |
+| Parquet File Size | Smaller (high

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-22 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268048597
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -35,91 +46,102 @@ When there is late arriving data (data intended for 9:00 
arriving >1 hr late at
 With the help of the timeline, an incremental query attempting to get all new 
data that was committed successfully since 10:00 hours, is able to very 
efficiently consume
 only the changed files without say scanning all the time buckets > 07:00.
 
-## Terminologies
+## File management
+Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
+very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
 
- * `Hudi Dataset`
-A structured hive/spark dataset managed by Hudi. Hudi supports both 
partitioned and non-partitioned Hive tables.
- * `Commit`
-A commit marks a new batch of data applied to a dataset. Hudi maintains  
monotonically increasing timestamps to track commits and guarantees that a 
commit is atomically
-published.
- * `Commit Timeline`
-Commit Timeline refers to the sequence of Commits that was applied in 
order on a dataset over its lifetime.
- * `File Slice`
-Hudi provides efficient handling of updates by having a fixed mapping 
between record key to a logical file Id.
-Hudi uses MVCC to provide atomicity and isolation of readers from a 
writer. This means that a logical fileId will
-have many physical versions of it. Each of these physical version of a 
file represents a complete view of the
-file as of a commit and is called File Slice
- * `File Group`
-A file-group is a file-slice timeline. It is a list of file-slices in 
commit order. It is identified by `file id`
+Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
+`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
+ along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
+Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
+unused/older file slices to reclaim space on DFS. 
 
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
+This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
+mapped file group contains all versions of a group of records.
 
-## Storage Types
+## Storage Types & Views
+Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
+This is not to be confused with the notion of `views`, which are merely how 
the underlying data is exposed to the queries (i.e how data is read). 
+
+| Storage Type  | Supported Views |
+|-- |--|
+| Copy On Write | Read Optimized + Incremental   |
+| Merge On Read | Read Optimized + Incremental + Near Real-time |
 
-Hudi storage types capture how data is indexed & laid out on the filesystem, 
and how the above primitives and timeline activities are implemented on top of
-such organization (i.e how data is written). This is not to be confused with 
the notion of Read Optimized & Near-Real time tables, which are merely how the 
underlying data is exposed
-to the queries (i.e how data is read).
+### Storage Types
+Hudi supports the following storage types.
 
-Hudi (will) supports the following storage types.
+  - [Copy On Write](#copy-on-write-storage) : Stores data using solely 
columnar file formats (e.g parquet). Updates simply version & rewrite the files 
by performing an synchronous merge during ingestion.
+  - [Merge On Read](#merge-on-read-storage) : Stores data using a combination 
of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are 
logged to delta files & later compacted to produce new versions of columnar 
files synchronously or asynchronously.
+
+Following table summarizes the trade-offs between these two storage types
 
-| Storage Type  | Supported Tables |
-|-- |--|
-| Copy On Write | Read Optimized   |
-| Merge On Read | Read Optimized + Near Real-time |
+| Trade-off | CopyOnWrite | MergeOnRead |
+|-- |--| --|
+| Data Latency | Higher   | Lower |
+| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta 
file) |
+| Parquet File Size | Smaller (high

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-22 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268048554
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -35,91 +46,102 @@ When there is late arriving data (data intended for 9:00 
arriving >1 hr late at
 With the help of the timeline, an incremental query attempting to get all new 
data that was committed successfully since 10:00 hours, is able to very 
efficiently consume
 only the changed files without say scanning all the time buckets > 07:00.
 
-## Terminologies
+## File management
+Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
+very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
 
- * `Hudi Dataset`
-A structured hive/spark dataset managed by Hudi. Hudi supports both 
partitioned and non-partitioned Hive tables.
- * `Commit`
-A commit marks a new batch of data applied to a dataset. Hudi maintains  
monotonically increasing timestamps to track commits and guarantees that a 
commit is atomically
-published.
- * `Commit Timeline`
-Commit Timeline refers to the sequence of Commits that was applied in 
order on a dataset over its lifetime.
- * `File Slice`
-Hudi provides efficient handling of updates by having a fixed mapping 
between record key to a logical file Id.
-Hudi uses MVCC to provide atomicity and isolation of readers from a 
writer. This means that a logical fileId will
-have many physical versions of it. Each of these physical version of a 
file represents a complete view of the
-file as of a commit and is called File Slice
- * `File Group`
-A file-group is a file-slice timeline. It is a list of file-slices in 
commit order. It is identified by `file id`
+Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
+`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
+ along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
+Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
+unused/older file slices to reclaim space on DFS. 
 
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
+This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
+mapped file group contains all versions of a group of records.
 
-## Storage Types
+## Storage Types & Views
+Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
+This is not to be confused with the notion of `views`, which are merely how 
the underlying data is exposed to the queries (i.e how data is read). 
+
+| Storage Type  | Supported Views |
+|-- |--|
+| Copy On Write | Read Optimized + Incremental   |
+| Merge On Read | Read Optimized + Incremental + Near Real-time |
 
-Hudi storage types capture how data is indexed & laid out on the filesystem, 
and how the above primitives and timeline activities are implemented on top of
-such organization (i.e how data is written). This is not to be confused with 
the notion of Read Optimized & Near-Real time tables, which are merely how the 
underlying data is exposed
-to the queries (i.e how data is read).
+### Storage Types
+Hudi supports the following storage types.
 
-Hudi (will) supports the following storage types.
+  - [Copy On Write](#copy-on-write-storage) : Stores data using solely 
columnar file formats (e.g parquet). Updates simply version & rewrite the files 
by performing an synchronous merge during ingestion.
+  - [Merge On Read](#merge-on-read-storage) : Stores data using a combination 
of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are 
logged to delta files & later compacted to produce new versions of columnar 
files synchronously or asynchronously.
+
+Following table summarizes the trade-offs between these two storage types
 
-| Storage Type  | Supported Tables |
-|-- |--|
-| Copy On Write | Read Optimized   |
-| Merge On Read | Read Optimized + Near Real-time |
+| Trade-off | CopyOnWrite | MergeOnRead |
+|-- |--| --|
+| Data Latency | Higher   | Lower |
+| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta 
file) |
+| Parquet File Size | Smaller (high

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-22 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268048477
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -35,91 +46,102 @@ When there is late arriving data (data intended for 9:00 
arriving >1 hr late at
 With the help of the timeline, an incremental query attempting to get all new 
data that was committed successfully since 10:00 hours, is able to very 
efficiently consume
 only the changed files without say scanning all the time buckets > 07:00.
 
-## Terminologies
+## File management
+Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
+very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
 
- * `Hudi Dataset`
-A structured hive/spark dataset managed by Hudi. Hudi supports both 
partitioned and non-partitioned Hive tables.
- * `Commit`
-A commit marks a new batch of data applied to a dataset. Hudi maintains  
monotonically increasing timestamps to track commits and guarantees that a 
commit is atomically
-published.
- * `Commit Timeline`
-Commit Timeline refers to the sequence of Commits that was applied in 
order on a dataset over its lifetime.
- * `File Slice`
-Hudi provides efficient handling of updates by having a fixed mapping 
between record key to a logical file Id.
-Hudi uses MVCC to provide atomicity and isolation of readers from a 
writer. This means that a logical fileId will
-have many physical versions of it. Each of these physical version of a 
file represents a complete view of the
-file as of a commit and is called File Slice
- * `File Group`
-A file-group is a file-slice timeline. It is a list of file-slices in 
commit order. It is identified by `file id`
+Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
+`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
+ along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
+Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
+unused/older file slices to reclaim space on DFS. 
 
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
+This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
+mapped file group contains all versions of a group of records.
 
-## Storage Types
+## Storage Types & Views
+Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
+This is not to be confused with the notion of `views`, which are merely how 
the underlying data is exposed to the queries (i.e how data is read). 
+
+| Storage Type  | Supported Views |
+|-- |--|
+| Copy On Write | Read Optimized + Incremental   |
+| Merge On Read | Read Optimized + Incremental + Near Real-time |
 
-Hudi storage types capture how data is indexed & laid out on the filesystem, 
and how the above primitives and timeline activities are implemented on top of
-such organization (i.e how data is written). This is not to be confused with 
the notion of Read Optimized & Near-Real time tables, which are merely how the 
underlying data is exposed
-to the queries (i.e how data is read).
+### Storage Types
+Hudi supports the following storage types.
 
-Hudi (will) supports the following storage types.
+  - [Copy On Write](#copy-on-write-storage) : Stores data using solely 
columnar file formats (e.g parquet). Updates simply version & rewrite the files 
by performing an synchronous merge during ingestion.
+  - [Merge On Read](#merge-on-read-storage) : Stores data using a combination 
of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are 
logged to delta files & later compacted to produce new versions of columnar 
files synchronously or asynchronously.
+
+Following table summarizes the trade-offs between these two storage types
 
-| Storage Type  | Supported Tables |
-|-- |--|
-| Copy On Write | Read Optimized   |
-| Merge On Read | Read Optimized + Near Real-time |
+| Trade-off | CopyOnWrite | MergeOnRead |
+|-- |--| --|
+| Data Latency | Higher   | Lower |
+| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta 
file) |
+| Parquet File Size | Smaller (high

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-22 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268047857
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -35,91 +46,102 @@ When there is late arriving data (data intended for 9:00 
arriving >1 hr late at
 With the help of the timeline, an incremental query attempting to get all new 
data that was committed successfully since 10:00 hours, is able to very 
efficiently consume
 only the changed files without say scanning all the time buckets > 07:00.
 
-## Terminologies
+## File management
+Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
+very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
 
- * `Hudi Dataset`
-A structured hive/spark dataset managed by Hudi. Hudi supports both 
partitioned and non-partitioned Hive tables.
- * `Commit`
-A commit marks a new batch of data applied to a dataset. Hudi maintains  
monotonically increasing timestamps to track commits and guarantees that a 
commit is atomically
-published.
- * `Commit Timeline`
-Commit Timeline refers to the sequence of Commits that was applied in 
order on a dataset over its lifetime.
- * `File Slice`
-Hudi provides efficient handling of updates by having a fixed mapping 
between record key to a logical file Id.
-Hudi uses MVCC to provide atomicity and isolation of readers from a 
writer. This means that a logical fileId will
-have many physical versions of it. Each of these physical version of a 
file represents a complete view of the
-file as of a commit and is called File Slice
- * `File Group`
-A file-group is a file-slice timeline. It is a list of file-slices in 
commit order. It is identified by `file id`
+Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
+`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
+ along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
+Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
+unused/older file slices to reclaim space on DFS. 
 
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
+This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
+mapped file group contains all versions of a group of records.
 
-## Storage Types
+## Storage Types & Views
+Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
+This is not to be confused with the notion of `views`, which are merely how 
the underlying data is exposed to the queries (i.e how data is read). 
+
+| Storage Type  | Supported Views |
+|-- |--|
+| Copy On Write | Read Optimized + Incremental   |
+| Merge On Read | Read Optimized + Incremental + Near Real-time |
 
-Hudi storage types capture how data is indexed & laid out on the filesystem, 
and how the above primitives and timeline activities are implemented on top of
-such organization (i.e how data is written). This is not to be confused with 
the notion of Read Optimized & Near-Real time tables, which are merely how the 
underlying data is exposed
-to the queries (i.e how data is read).
+### Storage Types
+Hudi supports the following storage types.
 
-Hudi (will) supports the following storage types.
+  - [Copy On Write](#copy-on-write-storage) : Stores data using solely 
columnar file formats (e.g parquet). Updates simply version & rewrite the files 
by performing an synchronous merge during ingestion.
+  - [Merge On Read](#merge-on-read-storage) : Stores data using a combination 
of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are 
logged to delta files & later compacted to produce new versions of columnar 
files synchronously or asynchronously.
+
+Following table summarizes the trade-offs between these two storage types
 
-| Storage Type  | Supported Tables |
-|-- |--|
-| Copy On Write | Read Optimized   |
-| Merge On Read | Read Optimized + Near Real-time |
+| Trade-off | CopyOnWrite | MergeOnRead |
+|-- |--| --|
+| Data Latency | Higher   | Lower |
+| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta 
file) |
+| Parquet File Size | Smaller (high

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-22 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268047857
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -35,91 +46,102 @@ When there is late arriving data (data intended for 9:00 
arriving >1 hr late at
 With the help of the timeline, an incremental query attempting to get all new 
data that was committed successfully since 10:00 hours, is able to very 
efficiently consume
 only the changed files without say scanning all the time buckets > 07:00.
 
-## Terminologies
+## File management
+Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
+very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
 
- * `Hudi Dataset`
-A structured hive/spark dataset managed by Hudi. Hudi supports both 
partitioned and non-partitioned Hive tables.
- * `Commit`
-A commit marks a new batch of data applied to a dataset. Hudi maintains  
monotonically increasing timestamps to track commits and guarantees that a 
commit is atomically
-published.
- * `Commit Timeline`
-Commit Timeline refers to the sequence of Commits that was applied in 
order on a dataset over its lifetime.
- * `File Slice`
-Hudi provides efficient handling of updates by having a fixed mapping 
between record key to a logical file Id.
-Hudi uses MVCC to provide atomicity and isolation of readers from a 
writer. This means that a logical fileId will
-have many physical versions of it. Each of these physical version of a 
file represents a complete view of the
-file as of a commit and is called File Slice
- * `File Group`
-A file-group is a file-slice timeline. It is a list of file-slices in 
commit order. It is identified by `file id`
+Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
+`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
+ along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
+Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
+unused/older file slices to reclaim space on DFS. 
 
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
+This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
+mapped file group contains all versions of a group of records.
 
-## Storage Types
+## Storage Types & Views
+Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
+This is not to be confused with the notion of `views`, which are merely how 
the underlying data is exposed to the queries (i.e how data is read). 
+
+| Storage Type  | Supported Views |
+|-- |--|
+| Copy On Write | Read Optimized + Incremental   |
+| Merge On Read | Read Optimized + Incremental + Near Real-time |
 
-Hudi storage types capture how data is indexed & laid out on the filesystem, 
and how the above primitives and timeline activities are implemented on top of
-such organization (i.e how data is written). This is not to be confused with 
the notion of Read Optimized & Near-Real time tables, which are merely how the 
underlying data is exposed
-to the queries (i.e how data is read).
+### Storage Types
+Hudi supports the following storage types.
 
-Hudi (will) supports the following storage types.
+  - [Copy On Write](#copy-on-write-storage) : Stores data using solely 
columnar file formats (e.g parquet). Updates simply version & rewrite the files 
by performing an synchronous merge during ingestion.
+  - [Merge On Read](#merge-on-read-storage) : Stores data using a combination 
of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are 
logged to delta files & later compacted to produce new versions of columnar 
files synchronously or asynchronously.
+
+Following table summarizes the trade-offs between these two storage types
 
-| Storage Type  | Supported Tables |
-|-- |--|
-| Copy On Write | Read Optimized   |
-| Merge On Read | Read Optimized + Near Real-time |
+| Trade-off | CopyOnWrite | MergeOnRead |
+|-- |--| --|
+| Data Latency | Higher   | Lower |
+| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta 
file) |
+| Parquet File Size | Smaller (high

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-22 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268046898
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -35,91 +46,102 @@ When there is late arriving data (data intended for 9:00 
arriving >1 hr late at
 With the help of the timeline, an incremental query attempting to get all new 
data that was committed successfully since 10:00 hours, is able to very 
efficiently consume
 only the changed files without say scanning all the time buckets > 07:00.
 
-## Terminologies
+## File management
+Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
+very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
 
- * `Hudi Dataset`
-A structured hive/spark dataset managed by Hudi. Hudi supports both 
partitioned and non-partitioned Hive tables.
- * `Commit`
-A commit marks a new batch of data applied to a dataset. Hudi maintains  
monotonically increasing timestamps to track commits and guarantees that a 
commit is atomically
-published.
- * `Commit Timeline`
-Commit Timeline refers to the sequence of Commits that was applied in 
order on a dataset over its lifetime.
- * `File Slice`
-Hudi provides efficient handling of updates by having a fixed mapping 
between record key to a logical file Id.
-Hudi uses MVCC to provide atomicity and isolation of readers from a 
writer. This means that a logical fileId will
-have many physical versions of it. Each of these physical version of a 
file represents a complete view of the
-file as of a commit and is called File Slice
- * `File Group`
-A file-group is a file-slice timeline. It is a list of file-slices in 
commit order. It is identified by `file id`
+Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
+`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
+ along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
+Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
+unused/older file slices to reclaim space on DFS. 
 
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
+This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
+mapped file group contains all versions of a group of records.
 
-## Storage Types
+## Storage Types & Views
+Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
+This is not to be confused with the notion of `views`, which are merely how 
the underlying data is exposed to the queries (i.e how data is read). 
 
 Review comment:
   `Views` are merely how the underlying data is exposed to the queries (i.e 
how data is read).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-21 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268040070
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -24,8 +29,14 @@ Such key activities include
  MergeOnRead storage type of dataset
  * `COMPACTIONS` - Background activity to reconcile differential data 
structures within Hudi e.g: moving updates from row based log files to columnar 
formats.
 
+Any given instant can be 
+in one of the following states
+
+ * `REQUESTED` - Denotes an action has been scheduled, but has not begun yet
 
 Review comment:
   nit : s/but has not begun yet/but has not been initiated


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-21 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268039634
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -7,15 +7,20 @@ toc: false
 summary: "Here we introduce some basic concepts & give a broad technical 
overview of Hudi"
 ---
 
-Apache Hudi (pronounced “Hudi”) provides the following primitives over 
datasets on DFS
+Apache Hudi (pronounced “Hudi”) provides the following streaming primitives 
over datasets on DFS
 
  * Upsert (how do I change the dataset?)
- * Incremental consumption(how do I fetch data that changed?)
+ * Incremental pull   (how do I fetch data that changed?)
 
+In this section, we will discuss key concepts & terminologies that are 
important to understand, to be able to effectively use these primitives.
 
-In order to achieve this, Hudi maintains a `timeline` of all activity 
performed on the dataset, that helps provide `instantaenous` views of the 
dataset,
-while also efficiently supporting retrieval of data in the order of arrival 
into the dataset.
-Such key activities include
+## Timeline
+At its core, Hudi maintains a `timeline` of all actions performed on the 
dataset at different `instants` of time that helps provide instantaenous views 
of the dataset,
 
 Review comment:
   'At its core, Hudi maintains a `timeline` of all actions performed on the 
dataset at different `instants` of time. This helps provide instantaneous views 
of the dataset while also supporting efficient retrieval of changed data in the 
order of arrival`. A Hudi instant is uniquely defined by 3 components namely, 
the action type, the instant time at which it started and it's current state...
   Note the spelling of instantaneous as well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-21 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268039634
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -7,15 +7,20 @@ toc: false
 summary: "Here we introduce some basic concepts & give a broad technical 
overview of Hudi"
 ---
 
-Apache Hudi (pronounced “Hudi”) provides the following primitives over 
datasets on DFS
+Apache Hudi (pronounced “Hudi”) provides the following streaming primitives 
over datasets on DFS
 
  * Upsert (how do I change the dataset?)
- * Incremental consumption(how do I fetch data that changed?)
+ * Incremental pull   (how do I fetch data that changed?)
 
+In this section, we will discuss key concepts & terminologies that are 
important to understand, to be able to effectively use these primitives.
 
-In order to achieve this, Hudi maintains a `timeline` of all activity 
performed on the dataset, that helps provide `instantaenous` views of the 
dataset,
-while also efficiently supporting retrieval of data in the order of arrival 
into the dataset.
-Such key activities include
+## Timeline
+At its core, Hudi maintains a `timeline` of all actions performed on the 
dataset at different `instants` of time that helps provide instantaenous views 
of the dataset,
 
 Review comment:
   'At its core, Hudi maintains a `timeline` of all actions performed on the 
dataset at different `instants` of time.This helps provide instantaneous views 
of the dataset`.
   Note the spelling of instantaneous as well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-21 Thread GitBox

n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268039634
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -7,15 +7,20 @@ toc: false
 summary: "Here we introduce some basic concepts & give a broad technical 
overview of Hudi"
 ---
 
-Apache Hudi (pronounced “Hudi”) provides the following primitives over 
datasets on DFS
+Apache Hudi (pronounced “Hudi”) provides the following streaming primitives 
over datasets on DFS
 
  * Upsert (how do I change the dataset?)
- * Incremental consumption(how do I fetch data that changed?)
+ * Incremental pull   (how do I fetch data that changed?)
 
+In this section, we will discuss key concepts & terminologies that are 
important to understand, to be able to effectively use these primitives.
 
-In order to achieve this, Hudi maintains a `timeline` of all activity 
performed on the dataset, that helps provide `instantaenous` views of the 
dataset,
-while also efficiently supporting retrieval of data in the order of arrival 
into the dataset.
-Such key activities include
+## Timeline
+At its core, Hudi maintains a `timeline` of all actions performed on the 
dataset at different `instants` of time that helps provide instantaenous views 
of the dataset,
 
 Review comment:
   'At its core, Hudi maintains a `timeline` of all actions performed on the 
dataset at different `instants` of time.This helps provide instantaneous views 
of the dataset while..`.
   Note the spelling of instantaneous as well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

21 matches

Site Navigation

Mail list logo

Footer information