[incubator-hudi] branch asf-site updated: [HUDI-577] update docker demo page and quick start pages (#1279)

bhavanisudha Thu, 30 Jan 2020 21:47:17 -0800

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new ace36af  [HUDI-577] update docker demo page and quick start pages 
(#1279)
ace36af is described below

commit ace36afff30e940f764e2ad8ddbfd194086a3bde
Author: Bhavani Sudha Saktheeswaran <bhasu...@uber.com>
AuthorDate: Thu Jan 30 21:46:25 2020 -0800

    [HUDI-577] update docker demo page and quick start pages (#1279)
    
    Summary:
    - contains changes that reflect renaming of terminologies to be in sync wth 
CWiki
    - contains doc changes pertaining to support of multiple scala versions
---
 docs/_docs/0_4_docker_demo.md       | 236 +++++++++++++++++++-----------------
 docs/_docs/1_1_quick_start_guide.md |  23 +++-
 docs/_docs/1_2_structure.md         |   2 +-
 docs/_docs/2_1_concepts.md          |   6 +-
 docs/_docs/2_2_writing_data.md      |  84 +++++++++----
 docs/_docs/2_3_querying_data.md     |  20 +--
 docs/_docs/2_5_performance.md       |   4 +-
 7 files changed, 217 insertions(+), 158 deletions(-)

diff --git a/docs/_docs/0_4_docker_demo.md b/docs/_docs/0_4_docker_demo.md
index 87c0716..3033371 100644
--- a/docs/_docs/0_4_docker_demo.md
+++ b/docs/_docs/0_4_docker_demo.md
@@ -40,7 +40,7 @@ Also, this has not been tested on some environments like 
Docker on Windows.
 
 ### Build Hudi
 
-The first step is to build hudi
+The first step is to build hudi. **Note** This step builds hudi on default 
supported scala version - 2.11.
 ```java
 cd <HUDI_WORKSPACE>
 mvn package -DskipTests
@@ -63,7 +63,10 @@ Stopping hivemetastore             ... done
 Stopping historyserver             ... done
 .......
 ......
-Creating network "hudi_demo" with the default driver
+Creating network "compose_default" with the default driver
+Creating volume "compose_namenode" with default driver
+Creating volume "compose_historyserver" with default driver
+Creating volume "compose_hive-metastore-postgresql" with default driver
 Creating hive-metastore-postgresql ... done
 Creating namenode                  ... done
 Creating zookeeper                 ... done
@@ -94,12 +97,12 @@ At this point, the docker cluster will be up and running. 
The demo cluster bring
 
 ## Demo
 
-Stock Tracker data will be used to showcase both different Hudi Views and the 
effects of Compaction.
+Stock Tracker data will be used to showcase different Hudi query types and the 
effects of Compaction.
 
 Take a look at the directory `docker/demo/data`. There are 2 batches of stock 
data - each at 1 minute granularity.
 The first batch contains stocker tracker data for some stock symbols during 
the first hour of trading window
 (9:30 a.m to 10:30 a.m). The second batch contains tracker data for next 30 
mins (10:30 - 11 a.m). Hudi will
-be used to ingest these batches to a dataset which will contain the latest 
stock tracker data at hour level granularity.
+be used to ingest these batches to a table which will contain the latest stock 
tracker data at hour level granularity.
 The batches are windowed intentionally so that the second batch contains 
updates to some of the rows in the first batch.
 
 ### Step 1 : Publish the first batch to Kafka
@@ -151,19 +154,19 @@ kafkacat -b kafkabroker -L -J | jq .
 ### Step 2: Incrementally ingest data from Kafka topic
 
 Hudi comes with a tool named DeltaStreamer. This tool can connect to variety 
of data sources (including Kafka) to
-pull changes and apply to Hudi dataset using upsert/insert primitives. Here, 
we will use the tool to download
+pull changes and apply to Hudi table using upsert/insert primitives. Here, we 
will use the tool to download
 json data from kafka topic and ingest to both COW and MOR tables we 
initialized in the previous step. This tool
-automatically initializes the datasets in the file-system if they do not exist 
yet.
+automatically initializes the tables in the file-system if they do not exist 
yet.
 
 ```java
 docker exec -it adhoc-2 /bin/bash
 
-# Run the following spark-submit command to execute the delta-streamer and 
ingest to stock_ticks_cow dataset in HDFS
-spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class 
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table 
stock_ticks_cow --props /var/demo/config/kafka-source.properties 
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
+# Run the following spark-submit command to execute the delta-streamer and 
ingest to stock_ticks_cow table in HDFS
+spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --table-type COPY_ON_WRITE --source-class 
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table 
stock_ticks_cow --props /var/demo/config/kafka-source.properties 
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
 
 
-# Run the following spark-submit command to execute the delta-streamer and 
ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class 
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table 
stock_ticks_mor --props /var/demo/config/kafka-source.properties 
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider 
--disable-compaction
+# Run the following spark-submit command to execute the delta-streamer and 
ingest to stock_ticks_mor table in HDFS
+spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --table-type MERGE_ON_READ --source-class 
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table 
stock_ticks_mor --props /var/demo/config/kafka-source.properties 
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider 
--disable-compaction
 
 
 # As part of the setup (Look at setup_demo.sh), the configs needed for 
DeltaStreamer is uploaded to HDFS. The configs
@@ -172,50 +175,50 @@ spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
 exit
 ```
 
-You can use HDFS web-browser to look at the datasets
+You can use HDFS web-browser to look at the tables
 `http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow`.
 
-You can explore the new partition folder created in the dataset along with a 
"deltacommit"
+You can explore the new partition folder created in the table along with a 
"deltacommit"
 file under .hoodie which signals a successful commit.
 
-There will be a similar setup when you browse the MOR dataset
+There will be a similar setup when you browse the MOR table
 `http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_mor`
 
 
 ### Step 3: Sync with Hive
 
-At this step, the datasets are available in HDFS. We need to sync with Hive to 
create new Hive tables and add partitions
-inorder to run Hive queries against those datasets.
+At this step, the tables are available in HDFS. We need to sync with Hive to 
create new Hive tables and add partitions
+inorder to run Hive queries against those tables.
 
 ```java
 docker exec -it adhoc-2 /bin/bash
 
-# THis command takes in HIveServer URL and COW Hudi Dataset location in HDFS 
and sync the HDFS state to Hive
+# THis command takes in HIveServer URL and COW Hudi table location in HDFS and 
sync the HDFS state to Hive
 /var/hoodie/ws/hudi-hive/run_sync_tool.sh  --jdbc-url 
jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt 
--base-path /user/hive/warehouse/stock_ticks_cow --database default --table 
stock_ticks_cow
 .....
-2018-09-24 22:22:45,568 INFO  [main] hive.HiveSyncTool 
(HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_cow
+2020-01-25 19:51:28,953 INFO  [main] hive.HiveSyncTool 
(HiveSyncTool.java:syncHoodieTable(129)) - Sync complete for stock_ticks_cow
 .....
 
-# Now run hive-sync for the second data-set in HDFS using Merge-On-Read (MOR 
storage)
+# Now run hive-sync for the second data-set in HDFS using Merge-On-Read (MOR 
table type)
 /var/hoodie/ws/hudi-hive/run_sync_tool.sh  --jdbc-url 
jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt 
--base-path /user/hive/warehouse/stock_ticks_mor --database default --table 
stock_ticks_mor
 ...
-2018-09-24 22:23:09,171 INFO  [main] hive.HiveSyncTool 
(HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_mor
+2020-01-25 19:51:51,066 INFO  [main] hive.HiveSyncTool 
(HiveSyncTool.java:syncHoodieTable(129)) - Sync complete for stock_ticks_mor_ro
 ...
-2018-09-24 22:23:09,559 INFO  [main] hive.HiveSyncTool 
(HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_mor_rt
+2020-01-25 19:51:51,569 INFO  [main] hive.HiveSyncTool 
(HiveSyncTool.java:syncHoodieTable(129)) - Sync complete for stock_ticks_mor_rt
 ....
 exit
 ```
 After executing the above command, you will notice
 
-1. A hive table named `stock_ticks_cow` created which provides Read-Optimized 
view for the Copy On Write dataset.
-2. Two new tables `stock_ticks_mor` and `stock_ticks_mor_rt` created for the 
Merge On Read dataset. The former
-provides the ReadOptimized view for the Hudi dataset and the later provides 
the realtime-view for the dataset.
+1. A hive table named `stock_ticks_cow` created which supports Snapshot and 
Incremental queries on Copy On Write table.
+2. Two new tables `stock_ticks_mor_rt` and `stock_ticks_mor_ro` created for 
the Merge On Read table. The former
+supports Snapshot and Incremental queries (providing near-real time data) 
while the later supports ReadOptimized queries.
 
 
 ### Step 4 (a): Run Hive Queries
 
-Run a hive query to find the latest timestamp ingested for stock symbol 
'GOOG'. You will notice that both read-optimized
-(for both COW and MOR dataset)and realtime views (for MOR dataset)give the 
same value "10:29 a.m" as Hudi create a
+Run a hive query to find the latest timestamp ingested for stock symbol 
'GOOG'. You will notice that both snapshot 
+(for both COW and MOR _rt table) and read-optimized queries (for MOR _ro 
table) give the same value "10:29 a.m" as Hudi create a
 parquet file for the first batch of data.
 
 ```java
@@ -227,10 +230,10 @@ beeline -u jdbc:hive2://hiveserver:10000 --hiveconf 
hive.input.format=org.apache
 |      tab_name       |
 +---------------------+--+
 | stock_ticks_cow     |
-| stock_ticks_mor     |
+| stock_ticks_mor_ro  |
 | stock_ticks_mor_rt  |
 +---------------------+--+
-2 rows selected (0.801 seconds)
+3 rows selected (1.199 seconds)
 0: jdbc:hive2://hiveserver:10000>
 
 
@@ -269,11 +272,11 @@ Now, run a projection query:
 # Merge-On-Read Queries:
 ==========================
 
-Lets run similar queries against M-O-R dataset. Lets look at both
-ReadOptimized and Realtime views supported by M-O-R dataset
+Lets run similar queries against M-O-R table. Lets look at both 
+ReadOptimized and Snapshot(realtime data) queries supported by M-O-R table
 
-# Run against ReadOptimized View. Notice that the latest timestamp is 10:29
-0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor 
group by symbol HAVING symbol = 'GOOG';
+# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from 
stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
 WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
future versions. Consider using a different execution engine (i.e. spark, tez) 
or using Hive 1.X releases.
 +---------+----------------------+--+
 | symbol  |         _c1          |
@@ -283,7 +286,7 @@ WARNING: Hive-on-MR is deprecated in Hive 2 and may not be 
available in the futu
 1 row selected (6.326 seconds)
 
 
-# Run against Realtime View. Notice that the latest timestamp is again 10:29
+# Run Snapshot Query. Notice that the latest timestamp is again 10:29
 
 0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from 
stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
 WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
future versions. Consider using a different execution engine (i.e. spark, tez) 
or using Hive 1.X releases.
@@ -295,9 +298,9 @@ WARNING: Hive-on-MR is deprecated in Hive 2 and may not be 
available in the futu
 1 row selected (1.606 seconds)
 
 
-# Run projection query against Read Optimized and Realtime tables
+# Run Read Optimized and Snapshot project queries
 
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, 
volume, open, close  from stock_ticks_mor where  symbol = 'GOOG';
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, 
volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
 | _hoodie_commit_time  | symbol  |          ts          | volume  |    open    
|   close   |
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
@@ -323,17 +326,17 @@ running in spark-sql
 
 ```java
 docker exec -it adhoc-1 /bin/bash
-$SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --master local[2] 
--driver-class-path $HADOOP_CONF_DIR --conf 
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  
--driver-memory 1G --executor-memory 3G --num-executors 1  --packages 
com.databricks:spark-avro_2.11:4.0.0
+$SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --master local[2] 
--driver-class-path $HADOOP_CONF_DIR --conf 
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  
--driver-memory 1G --executor-memory 3G --num-executors 1  --packages 
org.apache.spark:spark-avro_2.11:2.4.4
 ...
 
 Welcome to
       ____              __
      / __/__  ___ _____/ /__
     _\ \/ _ \/ _ `/ __/  '_/
-   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
+   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
       /_/
 
-Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
+Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
 Type in expressions to have them evaluated.
 Type :help for more information.
 
@@ -343,7 +346,7 @@ scala> spark.sql("show tables").show(100, false)
 |database|tableName         |isTemporary|
 +--------+------------------+-----------+
 |default |stock_ticks_cow   |false      |
-|default |stock_ticks_mor   |false      |
+|default |stock_ticks_mor_ro|false      |
 |default |stock_ticks_mor_rt|false      |
 +--------+------------------+-----------+
 
@@ -374,11 +377,11 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol, 
ts, volume, open, close
 # Merge-On-Read Queries:
 ==========================
 
-Lets run similar queries against M-O-R dataset. Lets look at both
-ReadOptimized and Realtime views supported by M-O-R dataset
+Lets run similar queries against M-O-R table. Lets look at both
+ReadOptimized and Snapshot queries supported by M-O-R table
 
-# Run against ReadOptimized View. Notice that the latest timestamp is 10:29
-scala> spark.sql("select symbol, max(ts) from stock_ticks_mor group by symbol 
HAVING symbol = 'GOOG'").show(100, false)
+# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by 
symbol HAVING symbol = 'GOOG'").show(100, false)
 +------+-------------------+
 |symbol|max(ts)            |
 +------+-------------------+
@@ -386,7 +389,7 @@ scala> spark.sql("select symbol, max(ts) from 
stock_ticks_mor group by symbol HA
 +------+-------------------+
 
 
-# Run against Realtime View. Notice that the latest timestamp is again 10:29
+# Run Snapshot Query. Notice that the latest timestamp is again 10:29
 
 scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by 
symbol HAVING symbol = 'GOOG'").show(100, false)
 +------+-------------------+
@@ -395,9 +398,9 @@ scala> spark.sql("select symbol, max(ts) from 
stock_ticks_mor_rt group by symbol
 |GOOG  |2018-08-31 10:29:00|
 +------+-------------------+
 
-# Run projection query against Read Optimized and Realtime tables
+# Run Read Optimized and Snapshot project queries
 
-scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, 
close  from stock_ticks_mor where  symbol = 'GOOG'").show(100, false)
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, 
close  from stock_ticks_mor_ro where  symbol = 'GOOG'").show(100, false)
 +-------------------+------+-------------------+------+---------+--------+
 |_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
 +-------------------+------+-------------------+------+---------+--------+
@@ -417,7 +420,7 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, 
volume, open, close
 
 ### Step 4 (c): Run Presto Queries
 
-Here are the Presto queries for similar Hive and Spark queries. Currently, 
Hudi does not support Presto queries on realtime views.
+Here are the Presto queries for similar Hive and Spark queries. Currently, 
Presto does not support snapshot or incremental queries on Hudi tables.
 
 ```java
 docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
@@ -440,7 +443,7 @@ presto:default> show tables;
        Table
 --------------------
  stock_ticks_cow
- stock_ticks_mor
+ stock_ticks_mor_ro
  stock_ticks_mor_rt
 (3 rows)
 
@@ -478,10 +481,10 @@ Splits: 17 total, 17 done (100.00%)
 # Merge-On-Read Queries:
 ==========================
 
-Lets run similar queries against M-O-R dataset. 
+Lets run similar queries against M-O-R table. 
 
-# Run against ReadOptimized View. Notice that the latest timestamp is 10:29
-presto:default> select symbol, max(ts) from stock_ticks_mor group by symbol 
HAVING symbol = 'GOOG';
+# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
+    presto:default> select symbol, max(ts) from stock_ticks_mor_ro group by 
symbol HAVING symbol = 'GOOG';
  symbol |        _col1
 --------+---------------------
  GOOG   | 2018-08-31 10:29:00
@@ -492,7 +495,7 @@ Splits: 49 total, 49 done (100.00%)
 0:02 [197 rows, 613B] [110 rows/s, 343B/s]
 
 
-presto:default>  select "_hoodie_commit_time", symbol, ts, volume, open, close 
 from stock_ticks_mor where  symbol = 'GOOG';
+presto:default>  select "_hoodie_commit_time", symbol, ts, volume, open, close 
 from stock_ticks_mor_ro where  symbol = 'GOOG';
  _hoodie_commit_time | symbol |         ts          | volume |   open    |  
close
 
---------------------+--------+---------------------+--------+-----------+----------
  20190822180250      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  
1230.02
@@ -517,12 +520,12 @@ cat docker/demo/data/batch_2.json | kafkacat -b 
kafkabroker -t stock_ticks -P
 # Within Docker container, run the ingestion command
 docker exec -it adhoc-2 /bin/bash
 
-# Run the following spark-submit command to execute the delta-streamer and 
ingest to stock_ticks_cow dataset in HDFS
-spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class 
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table 
stock_ticks_cow --props /var/demo/config/kafka-source.properties 
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
+# Run the following spark-submit command to execute the delta-streamer and 
ingest to stock_ticks_cow table in HDFS
+spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --table-type COPY_ON_WRITE --source-class 
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table 
stock_ticks_cow --props /var/demo/config/kafka-source.properties 
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
 
 
-# Run the following spark-submit command to execute the delta-streamer and 
ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class 
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table 
stock_ticks_mor --props /var/demo/config/kafka-source.properties 
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider 
--disable-compaction
+# Run the following spark-submit command to execute the delta-streamer and 
ingest to stock_ticks_mor table in HDFS
+spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --table-type MERGE_ON_READ --source-class 
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table 
stock_ticks_mor --props /var/demo/config/kafka-source.properties 
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider 
--disable-compaction
 
 exit
 ```
@@ -535,12 +538,12 @@ Take a look at the HDFS filesystem to get an idea: 
`http://namenode:50070/explor
 
 ### Step 6(a): Run Hive Queries
 
-With Copy-On-Write table, the read-optimized view immediately sees the changes 
as part of second batch once the batch
+With Copy-On-Write table, the Snapshot query immediately sees the changes as 
part of second batch once the batch
 got committed as each ingestion creates newer versions of parquet files.
 
 With Merge-On-Read table, the second ingestion merely appended the batch to an 
unmerged delta (log) file.
-This is the time, when ReadOptimized and Realtime views will provide different 
results. ReadOptimized view will still
-return "10:29 am" as it will only read from the Parquet file. Realtime View 
will do on-the-fly merge and return
+This is the time, when ReadOptimized and Snapshot queries will provide 
different results. ReadOptimized query will still
+return "10:29 am" as it will only read from the Parquet file. Snapshot query 
will do on-the-fly merge and return
 latest committed data which is "10:59 a.m".
 
 ```java
@@ -571,8 +574,8 @@ As you can notice, the above queries now reflect the 
changes that came as part o
 
 # Merge On Read Table:
 
-# Read Optimized View
-0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor 
group by symbol HAVING symbol = 'GOOG';
+# Read Optimized Query
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from 
stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
 WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
future versions. Consider using a different execution engine (i.e. spark, tez) 
or using Hive 1.X releases.
 +---------+----------------------+--+
 | symbol  |         _c1          |
@@ -581,7 +584,7 @@ WARNING: Hive-on-MR is deprecated in Hive 2 and may not be 
available in the futu
 +---------+----------------------+--+
 1 row selected (1.6 seconds)
 
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, 
volume, open, close  from stock_ticks_mor where  symbol = 'GOOG';
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, 
volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
 | _hoodie_commit_time  | symbol  |          ts          | volume  |    open    
|   close   |
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
@@ -589,7 +592,7 @@ WARNING: Hive-on-MR is deprecated in Hive 2 and may not be 
available in the futu
 | 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  
| 1230.085  |
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
 
-# Realtime View
+# Snapshot Query
 0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from 
stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
 WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
future versions. Consider using a different execution engine (i.e. spark, tez) 
or using Hive 1.X releases.
 +---------+----------------------+--+
@@ -616,7 +619,7 @@ Running the same queries in Spark-SQL:
 
 ```java
 docker exec -it adhoc-1 /bin/bash
-bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE 
--driver-class-path $HADOOP_CONF_DIR --conf 
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  
--driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1  
--packages com.databricks:spark-avro_2.11:4.0.0
+bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE 
--driver-class-path $HADOOP_CONF_DIR --conf 
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  
--driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1  
--packages org.apache.spark:spark-avro_2.11:2.4.4
 
 # Copy On Write Table:
 
@@ -641,8 +644,8 @@ As you can notice, the above queries now reflect the 
changes that came as part o
 
 # Merge On Read Table:
 
-# Read Optimized View
-scala> spark.sql("select symbol, max(ts) from stock_ticks_mor group by symbol 
HAVING symbol = 'GOOG'").show(100, false)
+# Read Optimized Query
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by 
symbol HAVING symbol = 'GOOG'").show(100, false)
 +---------+----------------------+--+
 | symbol  |         _c1          |
 +---------+----------------------+--+
@@ -650,7 +653,7 @@ scala> spark.sql("select symbol, max(ts) from 
stock_ticks_mor group by symbol HA
 +---------+----------------------+--+
 1 row selected (1.6 seconds)
 
-scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, 
close  from stock_ticks_mor where  symbol = 'GOOG'").show(100, false)
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, 
close  from stock_ticks_mor_ro where  symbol = 'GOOG'").show(100, false)
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
 | _hoodie_commit_time  | symbol  |          ts          | volume  |    open    
|   close   |
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
@@ -658,7 +661,7 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, 
volume, open, close
 | 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  
| 1230.085  |
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
 
-# Realtime View
+# Snapshot Query
 scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by 
symbol HAVING symbol = 'GOOG'").show(100, false)
 +---------+----------------------+--+
 | symbol  |         _c1          |
@@ -680,7 +683,7 @@ exit
 
 ### Step 6(c): Run Presto Queries
 
-Running the same queries on Presto for ReadOptimized views. 
+Running the same queries on Presto for ReadOptimized queries. 
 
 
 ```java
@@ -716,8 +719,8 @@ As you can notice, the above queries now reflect the 
changes that came as part o
 
 # Merge On Read Table:
 
-# Read Optimized View
-presto:default> select symbol, max(ts) from stock_ticks_mor group by symbol 
HAVING symbol = 'GOOG';
+# Read Optimized Query
+presto:default> select symbol, max(ts) from stock_ticks_mor_ro group by symbol 
HAVING symbol = 'GOOG';
  symbol |        _col1
 --------+---------------------
  GOOG   | 2018-08-31 10:29:00
@@ -727,7 +730,7 @@ Query 20190822_181602_00009_segyw, FINISHED, 1 node
 Splits: 49 total, 49 done (100.00%)
 0:01 [197 rows, 613B] [139 rows/s, 435B/s]
 
-presto:default>select "_hoodie_commit_time", symbol, ts, volume, open, close  
from stock_ticks_mor where  symbol = 'GOOG';
+presto:default>select "_hoodie_commit_time", symbol, ts, volume, open, close  
from stock_ticks_mor_ro where  symbol = 'GOOG';
  _hoodie_commit_time | symbol |         ts          | volume |   open    |  
close
 
---------------------+--------+---------------------+--------+-----------+----------
  20190822180250      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  
1230.02
@@ -744,7 +747,7 @@ presto:default> exit
 
 ### Step 7 : Incremental Query for COPY-ON-WRITE Table
 
-With 2 batches of data ingested, lets showcase the support for incremental 
queries in Hudi Copy-On-Write datasets
+With 2 batches of data ingested, lets showcase the support for incremental 
queries in Hudi Copy-On-Write tables
 
 Lets take the same projection query example
 
@@ -800,15 +803,15 @@ Here is the incremental query :
 ### Incremental Query with Spark SQL:
 ```java
 docker exec -it adhoc-1 /bin/bash
-bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE 
--driver-class-path $HADOOP_CONF_DIR --conf 
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  
--driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1  
--packages com.databricks:spark-avro_2.11:4.0.0
+bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE 
--driver-class-path $HADOOP_CONF_DIR --conf 
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  
--driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1  
--packages org.apache.spark:spark-avro_2.11:2.4.4
 Welcome to
       ____              __
      / __/__  ___ _____/ /__
     _\ \/ _ \/ _ `/ __/  '_/
-   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
+   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
       /_/
 
-Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
+Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
 Type in expressions to have them evaluated.
 Type :help for more information.
 
@@ -816,7 +819,7 @@ scala> import org.apache.hudi.DataSourceReadOptions
 import org.apache.hudi.DataSourceReadOptions
 
 # In the below query, 20180925045257 is the first commit's timestamp
-scala> val hoodieIncViewDF =  
spark.read.format("org.apache.hudi").option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,
 
DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL).option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY,
 "20180924064621").load("/user/hive/warehouse/stock_ticks_cow")
+scala> val hoodieIncViewDF =  
spark.read.format("org.apache.hudi").option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY,
 
DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL).option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY,
 "20180924064621").load("/user/hive/warehouse/stock_ticks_cow")
 SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
 SLF4J: Defaulting to no-operation (NOP) logger implementation
 SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
@@ -835,7 +838,7 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, 
volume, open, close
 ```
 
 
-### Step 8: Schedule and Run Compaction for Merge-On-Read dataset
+### Step 8: Schedule and Run Compaction for Merge-On-Read table
 
 Lets schedule and run a compaction to create a new version of columnar  file 
so that read-optimized readers will see fresher data.
 Again, You can use Hudi CLI to manually schedule and run compaction
@@ -843,24 +846,31 @@ Again, You can use Hudi CLI to manually schedule and run 
compaction
 ```java
 docker exec -it adhoc-1 /bin/bash
 root@adhoc-1:/opt#   /var/hoodie/ws/hudi-cli/hudi-cli.sh
-============================================
-*                                          *
-*     _    _           _   _               *
-*    | |  | |         | | (_)              *
-*    | |__| |       __| |  -               *
-*    |  __  ||   | / _` | ||               *
-*    | |  | ||   || (_| | ||               *
-*    |_|  |_|\___/ \____/ ||               *
-*                                          *
-============================================
-
-Welcome to Hoodie CLI. Please type help if you are looking for help.
+...
+Table command getting loaded
+HoodieSplashScreen loaded
+===================================================================
+*         ___                          ___                        *
+*        /\__\          ___           /\  \           ___         *
+*       / /  /         /\__\         /  \  \         /\  \        *
+*      / /__/         / /  /        / /\ \  \        \ \  \       *
+*     /  \  \ ___    / /  /        / /  \ \__\       /  \__\      *
+*    / /\ \  /\__\  / /__/  ___   / /__/ \ |__|     / /\/__/      *
+*    \/  \ \/ /  /  \ \  \ /\__\  \ \  \ / /  /  /\/ /  /         *
+*         \  /  /    \ \  / /  /   \ \  / /  /   \  /__/          *
+*         / /  /      \ \/ /  /     \ \/ /  /     \ \__\          *
+*        / /  /        \  /  /       \  /  /       \/__/          *
+*        \/__/          \/__/         \/__/    Apache Hudi CLI    *
+*                                                                 *
+===================================================================
+
+Welcome to Apache Hudi CLI. Please type help if you are looking for help.
 hudi->connect --path /user/hive/warehouse/stock_ticks_mor
 18/09/24 06:59:34 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
 18/09/24 06:59:35 INFO table.HoodieTableMetaClient: Loading 
HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
 18/09/24 06:59:35 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, 
mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, 
hdfs-default.xml, hdfs-site.xml], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root 
(auth:SIMPLE)]]]
-18/09/24 06:59:35 INFO table.HoodieTableConfig: Loading dataset properties 
from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
-18/09/24 06:59:36 INFO table.HoodieTableMetaClient: Finished Loading Table of 
type MERGE_ON_READ from /user/hive/warehouse/stock_ticks_mor
+18/09/24 06:59:35 INFO table.HoodieTableConfig: Loading table properties from 
/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
+18/09/24 06:59:36 INFO table.HoodieTableMetaClient: Finished Loading Table of 
type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor
 Metadata for table stock_ticks_mor loaded
 
 # Ensure no compactions are present
@@ -884,8 +894,8 @@ Compaction successfully completed for 20180924070031
 hoodie:stock_ticks->connect --path /user/hive/warehouse/stock_ticks_mor
 18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Loading 
HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
 18/09/24 07:01:16 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, 
mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, 
hdfs-default.xml, hdfs-site.xml], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root 
(auth:SIMPLE)]]]
-18/09/24 07:01:16 INFO table.HoodieTableConfig: Loading dataset properties 
from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
-18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Finished Loading Table of 
type MERGE_ON_READ from /user/hive/warehouse/stock_ticks_mor
+18/09/24 07:01:16 INFO table.HoodieTableConfig: Loading table properties from 
/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
+18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Finished Loading Table of 
type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor
 Metadata for table stock_ticks_mor loaded
 
 
@@ -911,8 +921,8 @@ Compaction successfully completed for 20180924070031
 hoodie:stock_ticks_mor->connect --path /user/hive/warehouse/stock_ticks_mor
 18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Loading 
HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
 18/09/24 07:03:00 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, 
mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, 
hdfs-default.xml, hdfs-site.xml], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root 
(auth:SIMPLE)]]]
-18/09/24 07:03:00 INFO table.HoodieTableConfig: Loading dataset properties 
from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
-18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Finished Loading Table of 
type MERGE_ON_READ from /user/hive/warehouse/stock_ticks_mor
+18/09/24 07:03:00 INFO table.HoodieTableConfig: Loading table properties from 
/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
+18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Finished Loading Table of 
type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor
 Metadata for table stock_ticks_mor loaded
 
 
@@ -928,7 +938,7 @@ hoodie:stock_ticks->compactions show all
 
 ### Step 9: Run Hive Queries including incremental queries
 
-You will see that both ReadOptimized and Realtime Views will show the latest 
committed data.
+You will see that both ReadOptimized and Snapshot queries will show the latest 
committed data.
 Lets also run the incremental query for MOR table.
 From looking at the below query output, it will be clear that the fist commit 
time for the MOR table is 20180924064636
 and the second commit time is 20180924070031
@@ -937,8 +947,8 @@ and the second commit time is 20180924070031
 docker exec -it adhoc-2 /bin/bash
 beeline -u jdbc:hive2://hiveserver:10000 --hiveconf 
hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf 
hive.stats.autogather=false
 
-# Read Optimized View
-0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor 
group by symbol HAVING symbol = 'GOOG';
+# Read Optimized Query
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from 
stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
 WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
future versions. Consider using a different execution engine (i.e. spark, tez) 
or using Hive 1.X releases.
 +---------+----------------------+--+
 | symbol  |         _c1          |
@@ -947,7 +957,7 @@ WARNING: Hive-on-MR is deprecated in Hive 2 and may not be 
available in the futu
 +---------+----------------------+--+
 1 row selected (1.6 seconds)
 
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, 
volume, open, close  from stock_ticks_mor where  symbol = 'GOOG';
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, 
volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
 | _hoodie_commit_time  | symbol  |          ts          | volume  |    open    
|   close   |
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
@@ -955,7 +965,7 @@ WARNING: Hive-on-MR is deprecated in Hive 2 and may not be 
available in the futu
 | 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  
| 1227.215  |
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
 
-# Realtime View
+# Snapshot Query
 0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from 
stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
 WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
future versions. Consider using a different execution engine (i.e. spark, tez) 
or using Hive 1.X releases.
 +---------+----------------------+--+
@@ -972,7 +982,7 @@ WARNING: Hive-on-MR is deprecated in Hive 2 and may not be 
available in the futu
 | 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  
| 1227.215  |
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
 
-# Incremental View:
+# Incremental Query:
 
 0: jdbc:hive2://hiveserver:10000> set 
hoodie.stock_ticks_mor.consume.mode=INCREMENTAL;
 No rows affected (0.008 seconds)
@@ -982,7 +992,7 @@ No rows affected (0.007 seconds)
 0: jdbc:hive2://hiveserver:10000> set 
hoodie.stock_ticks_mor.consume.start.timestamp=20180924064636;
 No rows affected (0.013 seconds)
 # Query:
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, 
volume, open, close  from stock_ticks_mor where  symbol = 'GOOG' and 
`_hoodie_commit_time` > '20180924064636';
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, 
volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG' and 
`_hoodie_commit_time` > '20180924064636';
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
 | _hoodie_commit_time  | symbol  |          ts          | volume  |    open    
|   close   |
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
@@ -992,14 +1002,14 @@ exit
 exit
 ```
 
-### Step 10: Read Optimized and Realtime Views for MOR with Spark-SQL after 
compaction
+### Step 10: Read Optimized and Snapshot queries for MOR with Spark-SQL after 
compaction
 
 ```java
 docker exec -it adhoc-1 /bin/bash
-bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE 
--driver-class-path $HADOOP_CONF_DIR --conf 
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  
--driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1  
--packages com.databricks:spark-avro_2.11:4.0.0
+bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE 
--driver-class-path $HADOOP_CONF_DIR --conf 
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  
--driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1  
--packages org.apache.spark:spark-avro_2.11:2.4.4
 
-# Read Optimized View
-scala> spark.sql("select symbol, max(ts) from stock_ticks_mor group by symbol 
HAVING symbol = 'GOOG'").show(100, false)
+# Read Optimized Query
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by 
symbol HAVING symbol = 'GOOG'").show(100, false)
 +---------+----------------------+--+
 | symbol  |         _c1          |
 +---------+----------------------+--+
@@ -1007,7 +1017,7 @@ scala> spark.sql("select symbol, max(ts) from 
stock_ticks_mor group by symbol HA
 +---------+----------------------+--+
 1 row selected (1.6 seconds)
 
-scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, 
close  from stock_ticks_mor where  symbol = 'GOOG'").show(100, false)
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, 
close  from stock_ticks_mor_ro where  symbol = 'GOOG'").show(100, false)
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
 | _hoodie_commit_time  | symbol  |          ts          | volume  |    open    
|   close   |
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
@@ -1015,7 +1025,7 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol, 
ts, volume, open, close
 | 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  
| 1227.215  |
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
 
-# Realtime View
+# Snapshot Query
 scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by 
symbol HAVING symbol = 'GOOG'").show(100, false)
 +---------+----------------------+--+
 | symbol  |         _c1          |
@@ -1032,15 +1042,15 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol, 
ts, volume, open, close
 
+----------------------+---------+----------------------+---------+------------+-----------+--+
 ```
 
-### Step 11:  Presto queries over Read Optimized View on MOR dataset after 
compaction
+### Step 11:  Presto Read Optimized queries on MOR table after compaction
 
 ```java
 docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
 presto> use hive.default;
 USE
 
-# Read Optimized View
-resto:default> select symbol, max(ts) from stock_ticks_mor group by symbol 
HAVING symbol = 'GOOG';
+# Read Optimized Query
+resto:default> select symbol, max(ts) from stock_ticks_mor_ro group by symbol 
HAVING symbol = 'GOOG';
   symbol |        _col1
 --------+---------------------
  GOOG   | 2018-08-31 10:59:00
@@ -1050,7 +1060,7 @@ Query 20190822_182319_00011_segyw, FINISHED, 1 node
 Splits: 49 total, 49 done (100.00%)
 0:01 [197 rows, 613B] [133 rows/s, 414B/s]
 
-presto:default> select "_hoodie_commit_time", symbol, ts, volume, open, close  
from stock_ticks_mor where  symbol = 'GOOG';
+presto:default> select "_hoodie_commit_time", symbol, ts, volume, open, close  
from stock_ticks_mor_ro where  symbol = 'GOOG';
  _hoodie_commit_time | symbol |         ts          | volume |   open    |  
close
 
---------------------+--------+---------------------+--------+-----------+----------
  20190822180250      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  
1230.02
@@ -1076,7 +1086,7 @@ $ mvn pre-integration-test -DskipTests
 ```
 The above command builds docker images for all the services with
 current Hudi source installed at /var/hoodie/ws and also brings up the 
services using a compose file. We
-currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.3.1) in docker 
images.
+currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.4.4) in docker 
images.
 
 To bring down the containers
 ```java
diff --git a/docs/_docs/1_1_quick_start_guide.md 
b/docs/_docs/1_1_quick_start_guide.md
index 708bd2c..d7d645e 100644
--- a/docs/_docs/1_1_quick_start_guide.md
+++ b/docs/_docs/1_1_quick_start_guide.md
@@ -16,10 +16,20 @@ Hudi works with Spark-2.x versions. You can follow 
instructions [here](https://s
 From the extracted directory run spark-shell with Hudi as:
 
 ```scala
-bin/spark-shell --packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating \
+spark-2.4.4-bin-hadoop2.7/bin/spark-shell --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 ```
 
+<div class="notice--info">
+  <h4>Please note the following: </h4>
+<ul>
+  <li>spark-avro module needs to be specified in --packages as it is not 
included with spark-shell by default</li>
+  <li>spark-avro and spark versions must match (we have used 2.4.4 for both 
above)</li>
+  <li>we have used hudi-spark-bundle built for scala 2.11 since the spark-avro 
module used also depends on 2.11. 
+         If spark-avro_2.12 is used, correspondingly hudi-spark-bundle_2.12 
needs to be used. </li>
+</ul>
+</div>
+
 Setup table name, base path and a data generator to generate records for this 
guide.
 
 ```scala
@@ -83,7 +93,7 @@ spark.sql("select _hoodie_commit_time, _hoodie_record_key, 
_hoodie_partition_pat
 
 This query provides snapshot querying of the ingested data. Since our 
partition path (`region/country/city`) is 3 levels nested 
 from base path we ve used `load(basePath + "/*/*/*/*")`. 
-Refer to [Table types and queries](/docs/concepts#table-types--queries) for 
more info on all table types and querying types supported.
+Refer to [Table types and queries](/docs/concepts#table-types--queries) for 
more info on all table types and query types supported.
 {: .notice--info}
 
 ## Update data
@@ -133,7 +143,7 @@ val tripsIncrementalDF = spark.
     option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
     option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
     load(basePath);
-tripsIncrementalDF.registerTempTable("hudi_trips_incremental")
+tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
 spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  
hudi_trips_incremental where fare > 20.0").show()
 ``` 
 
@@ -156,7 +166,7 @@ val tripsPointInTimeDF = 
spark.read.format("org.apache.hudi").
     option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
     option(END_INSTANTTIME_OPT_KEY, endTime).
     load(basePath);
-tripsPointInTimeDF.registerTempTable("hudi_trips_point_in_time")
+tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
 spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  
hudi_trips_point_in_time where fare > 20.0").show()
 ```
 
@@ -196,8 +206,9 @@ Note: Only `Append` mode is supported for delete operation.
 ## Where to go from here?
 
 You can also do the quickstart by [building hudi 
yourself](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source),
 
-and using `--jars <path to 
hudi_code>/packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar`
 in the spark-shell command above
-instead of `--packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating`
+and using `--jars <path to 
hudi_code>/packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar`
 in the spark-shell command above
+instead of `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`. Hudi also supports 
scala 2.12. Refer [build with scala 
2.12](https://github.com/apache/incubator-hudi#build-with-scala-212)
+for more info.
 
 Also, we used Spark here to show case the capabilities of Hudi. However, Hudi 
can support multiple table types/query types and 
 Hudi tables can be queried from query engines like Hive, Spark, Presto and 
much more. We have put together a 
diff --git a/docs/_docs/1_2_structure.md b/docs/_docs/1_2_structure.md
index e080fcd..ddcdb1a 100644
--- a/docs/_docs/1_2_structure.md
+++ b/docs/_docs/1_2_structure.md
@@ -6,7 +6,7 @@ summary: "Hudi brings stream processing to big data, providing 
fresh data while
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical 
tables over DFS 
([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html)
 or cloud stores) and provides three types of querying.
+Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical 
tables over DFS 
([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html)
 or cloud stores) and provides three types of queries.
 
  * **Read Optimized query** - Provides excellent query performance on pure 
columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
  * **Incremental query** - Provides a change stream out of the dataset to feed 
downstream jobs/ETLs.
diff --git a/docs/_docs/2_1_concepts.md b/docs/_docs/2_1_concepts.md
index c99aa41..cf61811 100644
--- a/docs/_docs/2_1_concepts.md
+++ b/docs/_docs/2_1_concepts.md
@@ -141,11 +141,11 @@ The intention of copy on write table, is to fundamentally 
improve how tables are
 
 Merge on read table is a superset of copy on write, in the sense it still 
supports read optimized queries of the table by exposing only the base/columnar 
files in latest file slices.
 Additionally, it stores incoming upserts for each file group, onto a row based 
delta log, to support snapshot queries by applying the delta log, 
-onto the latest version of each file id on-the-fly during query time. Thus, 
this table type attempts to balance read and write amplication intelligently, 
to provide near real-time data.
+onto the latest version of each file id on-the-fly during query time. Thus, 
this table type attempts to balance read and write amplification intelligently, 
to provide near real-time data.
 The most significant change here, would be to the compactor, which now 
carefully chooses which delta log files need to be compacted onto
 their columnar base file, to keep the query performance in check (larger delta 
log files would incur longer merge times with merge data on query side)
 
-Following illustrates how the table works, and shows two types of querying - 
snapshot querying and read optimized querying.
+Following illustrates how the table works, and shows two types of queries - 
snapshot query and read optimized query.
 
 <figure>
     <img class="docimage" src="/assets/images/hudi_mor.png" alt="hudi_mor.png" 
style="max-width: 100%" />
@@ -158,7 +158,7 @@ There are lot of interesting things happening in this 
example, which bring out t
  all the data from 10:05 to 10:10. The base columnar files are still versioned 
with the commit, as before.
  Thus, if one were to simply look at base files alone, then the table layout 
looks exactly like a copy on write table.
  - A periodic compaction process reconciles these changes from the delta log 
and produces a new version of base file, just like what happened at 10:05 in 
the example.
- - There are two ways of querying the same underlying table: Read Optimized 
querying and Snapshot querying, depending on whether we chose query performance 
or freshness of data.
+ - There are two ways of querying the same underlying table: Read Optimized 
query and Snapshot query, depending on whether we chose query performance or 
freshness of data.
  - The semantics around when data from a commit is available to a query 
changes in a subtle way for a read optimized query. Note, that such a query
  running at 10:10, wont see data after 10:05 above, while a snapshot query 
always sees the freshest data.
  - When we trigger compaction & what it decides to compact hold all the key to 
solving these hard problems. By implementing a compacting
diff --git a/docs/_docs/2_2_writing_data.md b/docs/_docs/2_2_writing_data.md
index b407111..52ba503 100644
--- a/docs/_docs/2_2_writing_data.md
+++ b/docs/_docs/2_2_writing_data.md
@@ -43,23 +43,56 @@ Command line options describe capabilities in more detail
 ```java
 [hoodie]$ spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls 
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
 Usage: <main class> [options]
-  Options:
+Options:
+    --checkpoint
+      Resume Delta Streamer from this checkpoint.
     --commit-on-errors
-        Commit even when some records failed to be written
+      Commit even when some records failed to be written
+      Default: false
+    --compact-scheduling-minshare
+      Minshare for compaction as defined in
+      https://spark.apache.org/docs/latest/job-scheduling.html
+      Default: 0
+    --compact-scheduling-weight
+      Scheduling weight for compaction as defined in
+      https://spark.apache.org/docs/latest/job-scheduling.html
+      Default: 1
+    --continuous
+      Delta Streamer runs in continuous mode running source-fetch -> Transform
+      -> Hudi Write in loop
+      Default: false
+    --delta-sync-scheduling-minshare
+      Minshare for delta sync as defined in
+      https://spark.apache.org/docs/latest/job-scheduling.html
+      Default: 0
+    --delta-sync-scheduling-weight
+      Scheduling weight for delta sync as defined in
+      https://spark.apache.org/docs/latest/job-scheduling.html
+      Default: 1
+    --disable-compaction
+      Compaction is enabled for MoR table by default. This flag disables it
       Default: false
     --enable-hive-sync
-          Enable syncing to hive
-       Default: false
+      Enable syncing to hive
+      Default: false
     --filter-dupes
-          Should duplicate records from source be dropped/filtered outbefore 
-          insert/bulk-insert 
+      Should duplicate records from source be dropped/filtered out before
+      insert/bulk-insert
       Default: false
     --help, -h
-    --hudi-conf
-          Any configuration that can be set in the properties file (using the 
CLI 
-          parameter "--propsFilePath") can also be passed command line using 
this 
-          parameter 
-          Default: []
+
+    --hoodie-conf
+      Any configuration that can be set in the properties file (using the CLI
+      parameter "--propsFilePath") can also be passed command line using this
+      parameter
+      Default: []
+    --max-pending-compactions
+      Maximum number of outstanding inflight/requested compactions. Delta Sync
+      will not happen unlessoutstanding compactions is less than this number
+      Default: 5
+    --min-sync-interval-seconds
+      the min sync interval of each sync in continuous mode
+      Default: 0
     --op
       Takes one of these values : UPSERT (default), INSERT (use when input is
       purely new data/inserts to gain speed)
@@ -69,19 +102,22 @@ Usage: <main class> [options]
       subclass of HoodieRecordPayload, that works off a GenericRecord.
       Implement your own, if you want to do something other than overwriting
       existing value
-      Default: org.apache.hudi.OverwriteWithLatestAvroPayload
+      Default: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
     --props
       path to properties file on localfs or dfs, with configurations for
-      Hudi client, schema provider, key generator and data source. For
-      Hudi client props, sane defaults are used, but recommend use to
+      hoodie client, schema provider, key generator and data source. For
+      hoodie client props, sane defaults are used, but recommend use to
       provide basic things like metrics endpoints, hive configs etc. For
       sources, referto individual classes, for supported properties.
       Default: 
file:///Users/vinoth/bin/hoodie/src/test/resources/delta-streamer-config/dfs-source.properties
     --schemaprovider-class
       subclass of org.apache.hudi.utilities.schema.SchemaProvider to attach
       schemas to input & target table data, built in options:
-      FilebasedSchemaProvider
-      Default: org.apache.hudi.utilities.schema.FilebasedSchemaProvider
+      org.apache.hudi.utilities.schema.FilebasedSchemaProvider.Source (See
+      org.apache.hudi.utilities.sources.Source) implementation can implement
+      their own SchemaProvider. For Sources that return Dataset<Row>, the
+      schema is obtained implicitly. However, this CLI option allows
+      overriding the schemaprovider returned by Source.
     --source-class
       Subclass of org.apache.hudi.utilities.sources to read data. Built-in
       options: org.apache.hudi.utilities.sources.{JsonDFSSource (default),
@@ -89,7 +125,7 @@ Usage: <main class> [options]
       Default: org.apache.hudi.utilities.sources.JsonDFSSource
     --source-limit
       Maximum amount of data to read from source. Default: No limit For e.g:
-      DFSSource => max bytes to read, KafkaSource => max events to read
+      DFS-Source => max bytes to read, Kafka-Source => max events to read
       Default: 9223372036854775807
     --source-ordering-field
       Field within source record to decide how to break ties between records
@@ -99,17 +135,19 @@ Usage: <main class> [options]
     --spark-master
       spark master to use.
       Default: local[2]
+  * --table-type
+      Type of table. COPY_ON_WRITE (or) MERGE_ON_READ
   * --target-base-path
-      base path for the target Hudi table. (Will be created if did not
-      exist first time around. If exists, expected to be a Hudi table)
+      base path for the target hoodie table. (Will be created if did not exist
+      first time around. If exists, expected to be a hoodie table)
   * --target-table
       name of the target table in Hive
     --transformer-class
-      subclass of org.apache.hudi.utilities.transform.Transformer. UDF to
-      transform raw source dataset to a target dataset (conforming to target
-      schema) before writing. Default : Not set. E:g -
+      subclass of org.apache.hudi.utilities.transform.Transformer. Allows
+      transforming raw source Dataset to a target Dataset (conforming to
+      target schema) before writing. Default : Not set. E:g -
       org.apache.hudi.utilities.transform.SqlQueryBasedTransformer (which
-      allows a SQL query template to be passed as a transformation function)
+      allows a SQL query templated to be passed as a transformation function)
 ```
 
 The tool takes a hierarchically composed property file and has pluggable 
interfaces for extracting data, key generation and providing schema. Sample 
configs for ingesting from kafka and dfs are
diff --git a/docs/_docs/2_3_querying_data.md b/docs/_docs/2_3_querying_data.md
index 8c6d357..2d97e2b 100644
--- a/docs/_docs/2_3_querying_data.md
+++ b/docs/_docs/2_3_querying_data.md
@@ -15,12 +15,12 @@ Specifically, following Hive tables are registered based 
off [table name](/docs/
 and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) passed during 
write.   
 
 If `table name = hudi_trips` and `table type = COPY_ON_WRITE`, then we get: 
- - `hudi_trips` supports snapshot querying and incremental querying of the 
table backed by `HoodieParquetInputFormat`, exposing purely columnar data.
+ - `hudi_trips` supports snapshot query and incremental query on the table 
backed by `HoodieParquetInputFormat`, exposing purely columnar data.
 
 
 If `table name = hudi_trips` and `table type = MERGE_ON_READ`, then we get:
- - `hudi_trips_rt` supports snapshot querying and incremental querying 
(providing near-real time data) of the table  backed by 
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
- - `hudi_trips_ro` supports read optimized querying of the table backed by 
`HoodieParquetInputFormat`, exposing purely columnar data.
+ - `hudi_trips_rt` supports snapshot query and incremental query (providing 
near-real time data) on the table  backed by 
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+ - `hudi_trips_ro` supports read optimized query on the table backed by 
`HoodieParquetInputFormat`, exposing purely columnar data.
  
 
 As discussed in the concepts section, the one key primitive needed for 
[incrementally 
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
@@ -89,11 +89,11 @@ separated) and calls InputFormat.listStatus() only once 
with all those partition
 Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
 
  - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to 
how standard datasources (e.g: `spark.read.parquet`) work.
- - **Read as Hive tables** : Supports all three query types, including the 
snapshot querying, relying on the custom Hudi input formats again like Hive.
+ - **Read as Hive tables** : Supports all three query types, including the 
snapshot queries, relying on the custom Hudi input formats again like Hive.
  
- In general, your spark job needs a dependency to `hudi-spark` or 
`hudi-spark-bundle-x.y.z.jar` needs to be on the class path of driver & 
executors (hint: use `--jars` argument)
+ In general, your spark job needs a dependency to `hudi-spark` or 
`hudi-spark-bundle_2.*-x.y.z.jar` needs to be on the class path of driver & 
executors (hint: use `--jars` argument)
  
-### Read optimized querying
+### Read optimized query
 
 Pushing a path filter into sparkContext as follows allows for read optimized 
querying of a Hudi hive table using SparkSQL. 
 This method retains Spark built-in optimizations for reading Parquet files 
like vectorized reading on Hudi tables.
@@ -110,12 +110,12 @@ Dataset<Row> hoodieROViewDF = 
spark.read().format("org.apache.hudi")
 .load("/glob/path/pattern");
 ```
  
-### Snapshot querying {#spark-snapshot-querying}
-Currently, near-real time data can only be queried as a Hive table in Spark 
using snapshot querying mode. In order to do this, set 
`spark.sql.hive.convertMetastoreParquet=false`, forcing Spark to fallback 
+### Snapshot query {#spark-snapshot-query}
+Currently, near-real time data can only be queried as a Hive table in Spark 
using snapshot query mode. In order to do this, set 
`spark.sql.hive.convertMetastoreParquet=false`, forcing Spark to fallback 
 to using the Hive Serde to read the data (planning/executions is still Spark). 
 
 ```java
-$ spark-shell --jars hudi-spark-bundle-x.y.z-SNAPSHOT.jar --driver-class-path 
/etc/hive/conf  --packages org.apache.spark:spark-avro_2.11:2.4.4 --conf 
spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 
7g --executor-memory 2g  --master yarn-client
+$ spark-shell --jars hudi-spark-bundle_2.11-x.y.z-SNAPSHOT.jar 
--driver-class-path /etc/hive/conf  --packages 
org.apache.spark:spark-avro_2.11:2.4.4 --conf 
spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 
7g --executor-memory 2g  --master yarn-client
 
 scala> sqlContext.sql("select count(*) from hudi_trips_rt where datestr = 
'2016-10-02'").show()
 scala> sqlContext.sql("select count(*) from hudi_trips_rt where datestr = 
'2016-10-02'").show()
@@ -148,5 +148,5 @@ Additionally, `HoodieReadClient` offers the following 
functionality using Hudi's
 
 ## Presto
 
-Presto is a popular query engine, providing interactive query performance. 
Presto currently supports only read optimized querying on Hudi tables. 
+Presto is a popular query engine, providing interactive query performance. 
Presto currently supports only read optimized queries on Hudi tables. 
 This requires the `hudi-presto-bundle` jar to be placed into 
`<presto_install>/plugin/hive-hadoop2/`, across the installation.
diff --git a/docs/_docs/2_5_performance.md b/docs/_docs/2_5_performance.md
index 6f489fc..580d180 100644
--- a/docs/_docs/2_5_performance.md
+++ b/docs/_docs/2_5_performance.md
@@ -41,9 +41,9 @@ For e.g , with 100M timestamp prefixed keys (5% updates, 95% 
inserts) on a event
 **~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a 
challenging workload like an '100% update' database ingestion workload spanning 
 3.25B UUID keys/30 partitions/6180 files using 300 cores, Hudi indexing offers 
a **80-100% speedup**.
 
-## Read Optimized Queries
+## Snapshot Queries
 
-The major design goal for read optimized querying is to achieve the latency 
reduction & efficiency gains in previous section,
+The major design goal for snapshot queries is to achieve the latency reduction 
& efficiency gains in previous section,
 with no impact on queries. Following charts compare the Hudi vs non-Hudi 
tables across Hive/Presto/Spark queries and demonstrate this.
 
 **Hive**

[incubator-hudi] branch asf-site updated: [HUDI-577] update docker demo page and quick start pages (#1279)

Reply via email to