[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

2020-01-20 Thread GitBox
vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] 
Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368784217
 
 

 ##
 File path: docs/_docs/2_2_writing_data.md
 ##
 @@ -156,41 +157,31 @@ inputDF.write()
 
 ## Syncing to Hive
 
-Both tools above support syncing of the dataset's latest schema to Hive 
metastore, such that queries can pick up new columns and partitions.
+Both tools above support syncing of the table's latest schema to Hive 
metastore, such that queries can pick up new columns and partitions.
 In case, its preferable to run this from commandline or in an independent jvm, 
Hudi provides a `HiveSyncTool`, which can be invoked as below, 
-once you have built the hudi-hive module.
+once you have built the hudi-hive module. Following is how we sync the above 
Datasource Writer written table to Hive metastore.
+
+```java
+cd hudi-hive
+./run_sync_tool.sh  --jdbc-url jdbc:hive2:\/\/hiveserver:1 --user hive 
--pass hive --partitioned-by partition --base-path  --database 
default --table 
+```
+
+Starting with Hudi 0.5.1 version read optimized version of merge-on-read 
tables are suffixed '_ro' by default. For backwards compatibility with older 
Hudi versions, 
+an optional HiveSyncConfig - `--skip-ro-suffix`, has been provided to turn off 
'_ro' suffixing if desired. Explore other hive sync options using the following 
command:
 
 ```java
 cd hudi-hive
 ./run_sync_tool.sh
  [hudi-hive]$ ./run_sync_tool.sh --help
-Usage:  [options]
-  Options:
-  * --base-path
-   Basepath of Hudi dataset to sync
-  * --database
-   name of the target database in Hive
---help, -h
-   Default: false
-  * --jdbc-url
-   Hive jdbc connect url
-  * --use-jdbc
-   Whether to use jdbc connection or hive metastore (via thrift)
-  * --pass
-   Hive password
-  * --table
-   name of the target table in Hive
-  * --user
-   Hive username
 ```
 
 ## Deletes 
 
-Hudi supports implementing two types of deletes on data stored in Hudi 
datasets, by enabling the user to specify a different record payload 
implementation. 
+Hudi supports implementing two types of deletes on data stored in Hudi tables, 
by enabling the user to specify a different record payload implementation. 
 
 Review comment:
   lets link to the delete blog from here? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

2020-01-20 Thread GitBox
vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] 
Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368783925
 
 

 ##
 File path: docs/_docs/2_1_concepts.md
 ##
 @@ -1,37 +1,37 @@
 ---
 title: "Concepts"
-keywords: hudi, design, storage, views, timeline
+keywords: hudi, design, table, queries, timeline
 permalink: /docs/concepts.html
 summary: "Here we introduce some basic concepts & give a broad technical 
overview of Hudi"
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Apache Hudi (pronounced “Hudi”) provides the following streaming primitives 
over datasets on DFS
+Apache Hudi (pronounced “Hudi”) provides the following streaming primitives 
over hadoop compatible storages
 
- * Upsert (how do I change the dataset?)
- * Incremental pull   (how do I fetch data that changed?)
+ * Update/Delete Records  (how do I change records in a table?)
+ * Change Streams (how do I fetch data that changed?)
 
 Review comment:
   how do I fetch `records` that changed ? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

2020-01-20 Thread GitBox
vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] 
Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368783760
 
 

 ##
 File path: docs/_docs/1_3_use_cases.md
 ##
 @@ -20,7 +20,7 @@ or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-st
 For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), even moderately big installations store 
billions of rows.
 It goes without saying that __full bulk loads are simply infeasible__ and more 
efficient approaches are needed if ingestion is to keep up with the typically 
high update volumes.
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps 
__enforces a minimum file size on HDFS__, which improves NameNode health by 
solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
+Even for immutable data sources like [Kafka](http://kafka.apache.org) , Hudi 
helps __enforces a minimum file size on HDFS__, which improves NameNode health 
by solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
 
 Review comment:
   ah. good catch


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

2020-01-20 Thread GitBox
vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] 
Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368784404
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -1,47 +1,52 @@
 ---
-title: Querying Hudi Datasets
+title: Querying Hudi Tables
 keywords: hudi, hive, spark, sql, presto
 permalink: /docs/querying_data.html
 summary: In this page, we go over how to enable SQL queries on Hudi built 
tables.
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Conceptually, Hudi stores data physically once on DFS, while providing 3 
logical views on top, as explained [before](/docs/concepts.html#views). 
-Once the dataset is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the dataset can be queried by popular query engines 
like Hive, Spark and Presto.
+Conceptually, Hudi stores data physically once on DFS, while providing 3 
different ways of querying, as explained 
[before](/docs/concepts.html#query-types). 
+Once the table is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi's custom inputformats. Once the proper hudi
+bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark and Presto.
 
-Specifically, there are two Hive tables named off [table 
name](/docs/configurations.html#TABLE_NAME_OPT_KEY) passed during write. 
-For e.g, if `table name = hudi_tbl`, then we get  
+Specifically, following Hive tables are registered based off [table 
name](/docs/configurations.html#TABLE_NAME_OPT_KEY) 
+and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) passed during 
write.   
 
- - `hudi_tbl` realizes the read optimized view of the dataset backed by 
`HoodieParquetInputFormat`, exposing purely columnar data.
- - `hudi_tbl_rt` realizes the real time view of the dataset  backed by 
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+If `table name = hudi_trips` and `table type = COPY_ON_WRITE`, then we get: 
+ - `hudi_trips` supports snapshot querying and incremental querying of the 
table backed by `HoodieParquetInputFormat`, exposing purely columnar data.
+
+
+If `table name = hudi_trips` and `table type = MERGE_ON_READ`, then we get:
+ - `hudi_trips_rt` supports snapshot querying and incremental querying 
(providing near-real time data) of the table  backed by 
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+ - `hudi_trips_ro` supports read optimized querying of the table backed by 
`HoodieParquetInputFormat`, exposing purely columnar data.
+ 
 
 As discussed in the concepts section, the one key primitive needed for 
[incrementally 
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
-is `incremental pulls` (to obtain a change stream/log from a dataset). Hudi 
datasets can be pulled incrementally, which means you can get ALL and ONLY the 
updated & new rows 
+is `incremental pulls` (to obtain a change stream/log from a table). Hudi 
tables can be pulled incrementally, which means you can get ALL and ONLY the 
updated & new rows 
 since a specified instant time. This, together with upserts, are particularly 
useful for building data pipelines where 1 or more source Hudi tables are 
incrementally pulled (streams/facts),
-joined with other tables (datasets/dimensions), to [write out 
deltas](/docs/writing_data.html) to a target Hudi dataset. Incremental view is 
realized by querying one of the tables above, 
-with special configurations that indicates to query planning that only 
incremental data needs to be fetched out of the dataset. 
+joined with other tables (tables/dimensions), to [write out 
deltas](/docs/writing_data.html) to a target Hudi table. Incremental view is 
realized by querying one of the tables above, 
+with special configurations that indicates to query planning that only 
incremental data needs to be fetched out of the table. 
 
-In sections, below we will discuss in detail how to access all the 3 views on 
each query engine.
+In sections, below we will discuss how to access these query types from 
different query engines.
 
 ## Hive
 
-In order for Hive to recognize Hudi datasets and query correctly, the 
HiveServer2 needs to be provided with the 
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` 
+In order for Hive to recognize Hudi tables and query correctly, the 
HiveServer2 needs to be provided with the 
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` 
 in its [aux jars 
path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr).
 This will ensure the input format 
 classes with its dependencies are available for query planning & execution. 
 
-### Read Optimized table
+### Read optimized querying
 In addition to setup above, for beeline cli access, the 

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

2020-01-20 Thread GitBox
vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] 
Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368783683
 
 

 ##
 File path: docs/_docs/1_2_structure.md
 ##
 @@ -6,16 +6,16 @@ summary: "Hudi brings stream processing to big data, 
providing fresh data while
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical 
datasets over DFS 
([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html)
 or cloud stores) and provides three logical views for query access.
+Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical 
tables over DFS 
([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html)
 or cloud stores) and provides three types of querying.
 
- * **Read Optimized View** - Provides excellent query performance on pure 
columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
- * **Incremental View** - Provides a change stream out of the dataset to feed 
downstream jobs/ETLs.
- * **Near-Real time Table** - Provides queries on real-time data, using a 
combination of columnar & row based storage (e.g Parquet + 
[Avro](http://avro.apache.org/docs/current/mr.html))
+ * **Read Optimized querying** - Provides excellent query performance on pure 
columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
 
 Review comment:
   just `Query` and not `querying`? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

2020-01-20 Thread GitBox
vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] 
Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368784018
 
 

 ##
 File path: docs/_docs/2_1_concepts.md
 ##
 @@ -53,69 +53,70 @@ With the help of the timeline, an incremental query 
attempting to get all new da
 only the changed files without say scanning all the time buckets > 07:00.
 
 ## File management
-Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
+Hudi organizes a table into a directory structure under a `basepath` on DFS. 
Table is broken up into partitions, which are folders containing data files for 
that partition,
 very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
 
 Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
-`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
+`file slices`, where each slice contains a base file (`*.parquet`) produced at 
a certain commit/compaction instant time,
  along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
 Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
 unused/older file slices to reclaim space on DFS. 
 
-Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
+## Index
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file id, via an indexing mechanism. 
 This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
 mapped file group contains all versions of a group of records.
 
-## Storage Types & Views
-Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
-In turn, `views` define how the underlying data is exposed to the queries (i.e 
how data is read). 
+## Table Types & Querying
 
 Review comment:
   and Queries (instead of Querying)? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services