This is an automated email from the ASF dual-hosted git repository.
bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new ace36af [HUDI-577] update docker demo page and quick start pages
(#1279)
ace36af is described below
commit ace36afff30e940f764e2ad8ddbfd194086a3bde
Author: Bhavani Sudha Saktheeswaran <[email protected]>
AuthorDate: Thu Jan 30 21:46:25 2020 -0800
[HUDI-577] update docker demo page and quick start pages (#1279)
Summary:
- contains changes that reflect renaming of terminologies to be in sync wth
CWiki
- contains doc changes pertaining to support of multiple scala versions
---
docs/_docs/0_4_docker_demo.md | 236 +++++++++++++++++++-----------------
docs/_docs/1_1_quick_start_guide.md | 23 +++-
docs/_docs/1_2_structure.md | 2 +-
docs/_docs/2_1_concepts.md | 6 +-
docs/_docs/2_2_writing_data.md | 84 +++++++++----
docs/_docs/2_3_querying_data.md | 20 +--
docs/_docs/2_5_performance.md | 4 +-
7 files changed, 217 insertions(+), 158 deletions(-)
diff --git a/docs/_docs/0_4_docker_demo.md b/docs/_docs/0_4_docker_demo.md
index 87c0716..3033371 100644
--- a/docs/_docs/0_4_docker_demo.md
+++ b/docs/_docs/0_4_docker_demo.md
@@ -40,7 +40,7 @@ Also, this has not been tested on some environments like
Docker on Windows.
### Build Hudi
-The first step is to build hudi
+The first step is to build hudi. **Note** This step builds hudi on default
supported scala version - 2.11.
```java
cd <HUDI_WORKSPACE>
mvn package -DskipTests
@@ -63,7 +63,10 @@ Stopping hivemetastore ... done
Stopping historyserver ... done
.......
......
-Creating network "hudi_demo" with the default driver
+Creating network "compose_default" with the default driver
+Creating volume "compose_namenode" with default driver
+Creating volume "compose_historyserver" with default driver
+Creating volume "compose_hive-metastore-postgresql" with default driver
Creating hive-metastore-postgresql ... done
Creating namenode ... done
Creating zookeeper ... done
@@ -94,12 +97,12 @@ At this point, the docker cluster will be up and running.
The demo cluster bring
## Demo
-Stock Tracker data will be used to showcase both different Hudi Views and the
effects of Compaction.
+Stock Tracker data will be used to showcase different Hudi query types and the
effects of Compaction.
Take a look at the directory `docker/demo/data`. There are 2 batches of stock
data - each at 1 minute granularity.
The first batch contains stocker tracker data for some stock symbols during
the first hour of trading window
(9:30 a.m to 10:30 a.m). The second batch contains tracker data for next 30
mins (10:30 - 11 a.m). Hudi will
-be used to ingest these batches to a dataset which will contain the latest
stock tracker data at hour level granularity.
+be used to ingest these batches to a table which will contain the latest stock
tracker data at hour level granularity.
The batches are windowed intentionally so that the second batch contains
updates to some of the rows in the first batch.
### Step 1 : Publish the first batch to Kafka
@@ -151,19 +154,19 @@ kafkacat -b kafkabroker -L -J | jq .
### Step 2: Incrementally ingest data from Kafka topic
Hudi comes with a tool named DeltaStreamer. This tool can connect to variety
of data sources (including Kafka) to
-pull changes and apply to Hudi dataset using upsert/insert primitives. Here,
we will use the tool to download
+pull changes and apply to Hudi table using upsert/insert primitives. Here, we
will use the tool to download
json data from kafka topic and ingest to both COW and MOR tables we
initialized in the previous step. This tool
-automatically initializes the datasets in the file-system if they do not exist
yet.
+automatically initializes the tables in the file-system if they do not exist
yet.
```java
docker exec -it adhoc-2 /bin/bash
-# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_cow dataset in HDFS
-spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table
stock_ticks_cow --props /var/demo/config/kafka-source.properties
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
+# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_cow table in HDFS
+spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --table-type COPY_ON_WRITE --source-class
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table
stock_ticks_cow --props /var/demo/config/kafka-source.properties
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
-# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table
stock_ticks_mor --props /var/demo/config/kafka-source.properties
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
--disable-compaction
+# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_mor table in HDFS
+spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --table-type MERGE_ON_READ --source-class
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table
stock_ticks_mor --props /var/demo/config/kafka-source.properties
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
--disable-compaction
# As part of the setup (Look at setup_demo.sh), the configs needed for
DeltaStreamer is uploaded to HDFS. The configs
@@ -172,50 +175,50 @@ spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
exit
```
-You can use HDFS web-browser to look at the datasets
+You can use HDFS web-browser to look at the tables
`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow`.
-You can explore the new partition folder created in the dataset along with a
"deltacommit"
+You can explore the new partition folder created in the table along with a
"deltacommit"
file under .hoodie which signals a successful commit.
-There will be a similar setup when you browse the MOR dataset
+There will be a similar setup when you browse the MOR table
`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_mor`
### Step 3: Sync with Hive
-At this step, the datasets are available in HDFS. We need to sync with Hive to
create new Hive tables and add partitions
-inorder to run Hive queries against those datasets.
+At this step, the tables are available in HDFS. We need to sync with Hive to
create new Hive tables and add partitions
+inorder to run Hive queries against those tables.
```java
docker exec -it adhoc-2 /bin/bash
-# THis command takes in HIveServer URL and COW Hudi Dataset location in HDFS
and sync the HDFS state to Hive
+# THis command takes in HIveServer URL and COW Hudi table location in HDFS and
sync the HDFS state to Hive
/var/hoodie/ws/hudi-hive/run_sync_tool.sh --jdbc-url
jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt
--base-path /user/hive/warehouse/stock_ticks_cow --database default --table
stock_ticks_cow
.....
-2018-09-24 22:22:45,568 INFO [main] hive.HiveSyncTool
(HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_cow
+2020-01-25 19:51:28,953 INFO [main] hive.HiveSyncTool
(HiveSyncTool.java:syncHoodieTable(129)) - Sync complete for stock_ticks_cow
.....
-# Now run hive-sync for the second data-set in HDFS using Merge-On-Read (MOR
storage)
+# Now run hive-sync for the second data-set in HDFS using Merge-On-Read (MOR
table type)
/var/hoodie/ws/hudi-hive/run_sync_tool.sh --jdbc-url
jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt
--base-path /user/hive/warehouse/stock_ticks_mor --database default --table
stock_ticks_mor
...
-2018-09-24 22:23:09,171 INFO [main] hive.HiveSyncTool
(HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_mor
+2020-01-25 19:51:51,066 INFO [main] hive.HiveSyncTool
(HiveSyncTool.java:syncHoodieTable(129)) - Sync complete for stock_ticks_mor_ro
...
-2018-09-24 22:23:09,559 INFO [main] hive.HiveSyncTool
(HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_mor_rt
+2020-01-25 19:51:51,569 INFO [main] hive.HiveSyncTool
(HiveSyncTool.java:syncHoodieTable(129)) - Sync complete for stock_ticks_mor_rt
....
exit
```
After executing the above command, you will notice
-1. A hive table named `stock_ticks_cow` created which provides Read-Optimized
view for the Copy On Write dataset.
-2. Two new tables `stock_ticks_mor` and `stock_ticks_mor_rt` created for the
Merge On Read dataset. The former
-provides the ReadOptimized view for the Hudi dataset and the later provides
the realtime-view for the dataset.
+1. A hive table named `stock_ticks_cow` created which supports Snapshot and
Incremental queries on Copy On Write table.
+2. Two new tables `stock_ticks_mor_rt` and `stock_ticks_mor_ro` created for
the Merge On Read table. The former
+supports Snapshot and Incremental queries (providing near-real time data)
while the later supports ReadOptimized queries.
### Step 4 (a): Run Hive Queries
-Run a hive query to find the latest timestamp ingested for stock symbol
'GOOG'. You will notice that both read-optimized
-(for both COW and MOR dataset)and realtime views (for MOR dataset)give the
same value "10:29 a.m" as Hudi create a
+Run a hive query to find the latest timestamp ingested for stock symbol
'GOOG'. You will notice that both snapshot
+(for both COW and MOR _rt table) and read-optimized queries (for MOR _ro
table) give the same value "10:29 a.m" as Hudi create a
parquet file for the first batch of data.
```java
@@ -227,10 +230,10 @@ beeline -u jdbc:hive2://hiveserver:10000 --hiveconf
hive.input.format=org.apache
| tab_name |
+---------------------+--+
| stock_ticks_cow |
-| stock_ticks_mor |
+| stock_ticks_mor_ro |
| stock_ticks_mor_rt |
+---------------------+--+
-2 rows selected (0.801 seconds)
+3 rows selected (1.199 seconds)
0: jdbc:hive2://hiveserver:10000>
@@ -269,11 +272,11 @@ Now, run a projection query:
# Merge-On-Read Queries:
==========================
-Lets run similar queries against M-O-R dataset. Lets look at both
-ReadOptimized and Realtime views supported by M-O-R dataset
+Lets run similar queries against M-O-R table. Lets look at both
+ReadOptimized and Snapshot(realtime data) queries supported by M-O-R table
-# Run against ReadOptimized View. Notice that the latest timestamp is 10:29
-0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor
group by symbol HAVING symbol = 'GOOG';
+# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from
stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark, tez)
or using Hive 1.X releases.
+---------+----------------------+--+
| symbol | _c1 |
@@ -283,7 +286,7 @@ WARNING: Hive-on-MR is deprecated in Hive 2 and may not be
available in the futu
1 row selected (6.326 seconds)
-# Run against Realtime View. Notice that the latest timestamp is again 10:29
+# Run Snapshot Query. Notice that the latest timestamp is again 10:29
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from
stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark, tez)
or using Hive 1.X releases.
@@ -295,9 +298,9 @@ WARNING: Hive-on-MR is deprecated in Hive 2 and may not be
available in the futu
1 row selected (1.606 seconds)
-# Run projection query against Read Optimized and Realtime tables
+# Run Read Optimized and Snapshot project queries
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts,
volume, open, close from stock_ticks_mor where symbol = 'GOOG';
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts,
volume, open, close from stock_ticks_mor_ro where symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open
| close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
@@ -323,17 +326,17 @@ running in spark-sql
```java
docker exec -it adhoc-1 /bin/bash
-$SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --master local[2]
--driver-class-path $HADOOP_CONF_DIR --conf
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client
--driver-memory 1G --executor-memory 3G --num-executors 1 --packages
com.databricks:spark-avro_2.11:4.0.0
+$SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --master local[2]
--driver-class-path $HADOOP_CONF_DIR --conf
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client
--driver-memory 1G --executor-memory 3G --num-executors 1 --packages
org.apache.spark:spark-avro_2.11:2.4.4
...
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
- /___/ .__/\_,_/_/ /_/\_\ version 2.3.1
+ /___/ .__/\_,_/_/ /_/\_\ version 2.4.4
/_/
-Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
+Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.
@@ -343,7 +346,7 @@ scala> spark.sql("show tables").show(100, false)
|database|tableName |isTemporary|
+--------+------------------+-----------+
|default |stock_ticks_cow |false |
-|default |stock_ticks_mor |false |
+|default |stock_ticks_mor_ro|false |
|default |stock_ticks_mor_rt|false |
+--------+------------------+-----------+
@@ -374,11 +377,11 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol,
ts, volume, open, close
# Merge-On-Read Queries:
==========================
-Lets run similar queries against M-O-R dataset. Lets look at both
-ReadOptimized and Realtime views supported by M-O-R dataset
+Lets run similar queries against M-O-R table. Lets look at both
+ReadOptimized and Snapshot queries supported by M-O-R table
-# Run against ReadOptimized View. Notice that the latest timestamp is 10:29
-scala> spark.sql("select symbol, max(ts) from stock_ticks_mor group by symbol
HAVING symbol = 'GOOG'").show(100, false)
+# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by
symbol HAVING symbol = 'GOOG'").show(100, false)
+------+-------------------+
|symbol|max(ts) |
+------+-------------------+
@@ -386,7 +389,7 @@ scala> spark.sql("select symbol, max(ts) from
stock_ticks_mor group by symbol HA
+------+-------------------+
-# Run against Realtime View. Notice that the latest timestamp is again 10:29
+# Run Snapshot Query. Notice that the latest timestamp is again 10:29
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by
symbol HAVING symbol = 'GOOG'").show(100, false)
+------+-------------------+
@@ -395,9 +398,9 @@ scala> spark.sql("select symbol, max(ts) from
stock_ticks_mor_rt group by symbol
|GOOG |2018-08-31 10:29:00|
+------+-------------------+
-# Run projection query against Read Optimized and Realtime tables
+# Run Read Optimized and Snapshot project queries
-scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open,
close from stock_ticks_mor where symbol = 'GOOG'").show(100, false)
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open,
close from stock_ticks_mor_ro where symbol = 'GOOG'").show(100, false)
+-------------------+------+-------------------+------+---------+--------+
|_hoodie_commit_time|symbol|ts |volume|open |close |
+-------------------+------+-------------------+------+---------+--------+
@@ -417,7 +420,7 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol, ts,
volume, open, close
### Step 4 (c): Run Presto Queries
-Here are the Presto queries for similar Hive and Spark queries. Currently,
Hudi does not support Presto queries on realtime views.
+Here are the Presto queries for similar Hive and Spark queries. Currently,
Presto does not support snapshot or incremental queries on Hudi tables.
```java
docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
@@ -440,7 +443,7 @@ presto:default> show tables;
Table
--------------------
stock_ticks_cow
- stock_ticks_mor
+ stock_ticks_mor_ro
stock_ticks_mor_rt
(3 rows)
@@ -478,10 +481,10 @@ Splits: 17 total, 17 done (100.00%)
# Merge-On-Read Queries:
==========================
-Lets run similar queries against M-O-R dataset.
+Lets run similar queries against M-O-R table.
-# Run against ReadOptimized View. Notice that the latest timestamp is 10:29
-presto:default> select symbol, max(ts) from stock_ticks_mor group by symbol
HAVING symbol = 'GOOG';
+# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
+ presto:default> select symbol, max(ts) from stock_ticks_mor_ro group by
symbol HAVING symbol = 'GOOG';
symbol | _col1
--------+---------------------
GOOG | 2018-08-31 10:29:00
@@ -492,7 +495,7 @@ Splits: 49 total, 49 done (100.00%)
0:02 [197 rows, 613B] [110 rows/s, 343B/s]
-presto:default> select "_hoodie_commit_time", symbol, ts, volume, open, close
from stock_ticks_mor where symbol = 'GOOG';
+presto:default> select "_hoodie_commit_time", symbol, ts, volume, open, close
from stock_ticks_mor_ro where symbol = 'GOOG';
_hoodie_commit_time | symbol | ts | volume | open |
close
---------------------+--------+---------------------+--------+-----------+----------
20190822180250 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 |
1230.02
@@ -517,12 +520,12 @@ cat docker/demo/data/batch_2.json | kafkacat -b
kafkabroker -t stock_ticks -P
# Within Docker container, run the ingestion command
docker exec -it adhoc-2 /bin/bash
-# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_cow dataset in HDFS
-spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table
stock_ticks_cow --props /var/demo/config/kafka-source.properties
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
+# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_cow table in HDFS
+spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --table-type COPY_ON_WRITE --source-class
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table
stock_ticks_cow --props /var/demo/config/kafka-source.properties
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
-# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table
stock_ticks_mor --props /var/demo/config/kafka-source.properties
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
--disable-compaction
+# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_mor table in HDFS
+spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --table-type MERGE_ON_READ --source-class
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table
stock_ticks_mor --props /var/demo/config/kafka-source.properties
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
--disable-compaction
exit
```
@@ -535,12 +538,12 @@ Take a look at the HDFS filesystem to get an idea:
`http://namenode:50070/explor
### Step 6(a): Run Hive Queries
-With Copy-On-Write table, the read-optimized view immediately sees the changes
as part of second batch once the batch
+With Copy-On-Write table, the Snapshot query immediately sees the changes as
part of second batch once the batch
got committed as each ingestion creates newer versions of parquet files.
With Merge-On-Read table, the second ingestion merely appended the batch to an
unmerged delta (log) file.
-This is the time, when ReadOptimized and Realtime views will provide different
results. ReadOptimized view will still
-return "10:29 am" as it will only read from the Parquet file. Realtime View
will do on-the-fly merge and return
+This is the time, when ReadOptimized and Snapshot queries will provide
different results. ReadOptimized query will still
+return "10:29 am" as it will only read from the Parquet file. Snapshot query
will do on-the-fly merge and return
latest committed data which is "10:59 a.m".
```java
@@ -571,8 +574,8 @@ As you can notice, the above queries now reflect the
changes that came as part o
# Merge On Read Table:
-# Read Optimized View
-0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor
group by symbol HAVING symbol = 'GOOG';
+# Read Optimized Query
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from
stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark, tez)
or using Hive 1.X releases.
+---------+----------------------+--+
| symbol | _c1 |
@@ -581,7 +584,7 @@ WARNING: Hive-on-MR is deprecated in Hive 2 and may not be
available in the futu
+---------+----------------------+--+
1 row selected (1.6 seconds)
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts,
volume, open, close from stock_ticks_mor where symbol = 'GOOG';
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts,
volume, open, close from stock_ticks_mor_ro where symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open
| close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
@@ -589,7 +592,7 @@ WARNING: Hive-on-MR is deprecated in Hive 2 and may not be
available in the futu
| 20180924222155 | GOOG | 2018-08-31 10:29:00 | 3391 | 1230.1899
| 1230.085 |
+----------------------+---------+----------------------+---------+------------+-----------+--+
-# Realtime View
+# Snapshot Query
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from
stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark, tez)
or using Hive 1.X releases.
+---------+----------------------+--+
@@ -616,7 +619,7 @@ Running the same queries in Spark-SQL:
```java
docker exec -it adhoc-1 /bin/bash
-bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE
--driver-class-path $HADOOP_CONF_DIR --conf
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client
--driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1
--packages com.databricks:spark-avro_2.11:4.0.0
+bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE
--driver-class-path $HADOOP_CONF_DIR --conf
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client
--driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1
--packages org.apache.spark:spark-avro_2.11:2.4.4
# Copy On Write Table:
@@ -641,8 +644,8 @@ As you can notice, the above queries now reflect the
changes that came as part o
# Merge On Read Table:
-# Read Optimized View
-scala> spark.sql("select symbol, max(ts) from stock_ticks_mor group by symbol
HAVING symbol = 'GOOG'").show(100, false)
+# Read Optimized Query
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by
symbol HAVING symbol = 'GOOG'").show(100, false)
+---------+----------------------+--+
| symbol | _c1 |
+---------+----------------------+--+
@@ -650,7 +653,7 @@ scala> spark.sql("select symbol, max(ts) from
stock_ticks_mor group by symbol HA
+---------+----------------------+--+
1 row selected (1.6 seconds)
-scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open,
close from stock_ticks_mor where symbol = 'GOOG'").show(100, false)
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open,
close from stock_ticks_mor_ro where symbol = 'GOOG'").show(100, false)
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open
| close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
@@ -658,7 +661,7 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol, ts,
volume, open, close
| 20180924222155 | GOOG | 2018-08-31 10:29:00 | 3391 | 1230.1899
| 1230.085 |
+----------------------+---------+----------------------+---------+------------+-----------+--+
-# Realtime View
+# Snapshot Query
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by
symbol HAVING symbol = 'GOOG'").show(100, false)
+---------+----------------------+--+
| symbol | _c1 |
@@ -680,7 +683,7 @@ exit
### Step 6(c): Run Presto Queries
-Running the same queries on Presto for ReadOptimized views.
+Running the same queries on Presto for ReadOptimized queries.
```java
@@ -716,8 +719,8 @@ As you can notice, the above queries now reflect the
changes that came as part o
# Merge On Read Table:
-# Read Optimized View
-presto:default> select symbol, max(ts) from stock_ticks_mor group by symbol
HAVING symbol = 'GOOG';
+# Read Optimized Query
+presto:default> select symbol, max(ts) from stock_ticks_mor_ro group by symbol
HAVING symbol = 'GOOG';
symbol | _col1
--------+---------------------
GOOG | 2018-08-31 10:29:00
@@ -727,7 +730,7 @@ Query 20190822_181602_00009_segyw, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0:01 [197 rows, 613B] [139 rows/s, 435B/s]
-presto:default>select "_hoodie_commit_time", symbol, ts, volume, open, close
from stock_ticks_mor where symbol = 'GOOG';
+presto:default>select "_hoodie_commit_time", symbol, ts, volume, open, close
from stock_ticks_mor_ro where symbol = 'GOOG';
_hoodie_commit_time | symbol | ts | volume | open |
close
---------------------+--------+---------------------+--------+-----------+----------
20190822180250 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 |
1230.02
@@ -744,7 +747,7 @@ presto:default> exit
### Step 7 : Incremental Query for COPY-ON-WRITE Table
-With 2 batches of data ingested, lets showcase the support for incremental
queries in Hudi Copy-On-Write datasets
+With 2 batches of data ingested, lets showcase the support for incremental
queries in Hudi Copy-On-Write tables
Lets take the same projection query example
@@ -800,15 +803,15 @@ Here is the incremental query :
### Incremental Query with Spark SQL:
```java
docker exec -it adhoc-1 /bin/bash
-bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE
--driver-class-path $HADOOP_CONF_DIR --conf
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client
--driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1
--packages com.databricks:spark-avro_2.11:4.0.0
+bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE
--driver-class-path $HADOOP_CONF_DIR --conf
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client
--driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1
--packages org.apache.spark:spark-avro_2.11:2.4.4
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
- /___/ .__/\_,_/_/ /_/\_\ version 2.3.1
+ /___/ .__/\_,_/_/ /_/\_\ version 2.4.4
/_/
-Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
+Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.
@@ -816,7 +819,7 @@ scala> import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.DataSourceReadOptions
# In the below query, 20180925045257 is the first commit's timestamp
-scala> val hoodieIncViewDF =
spark.read.format("org.apache.hudi").option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,
DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL).option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY,
"20180924064621").load("/user/hive/warehouse/stock_ticks_cow")
+scala> val hoodieIncViewDF =
spark.read.format("org.apache.hudi").option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY,
DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL).option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY,
"20180924064621").load("/user/hive/warehouse/stock_ticks_cow")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
@@ -835,7 +838,7 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol, ts,
volume, open, close
```
-### Step 8: Schedule and Run Compaction for Merge-On-Read dataset
+### Step 8: Schedule and Run Compaction for Merge-On-Read table
Lets schedule and run a compaction to create a new version of columnar file
so that read-optimized readers will see fresher data.
Again, You can use Hudi CLI to manually schedule and run compaction
@@ -843,24 +846,31 @@ Again, You can use Hudi CLI to manually schedule and run
compaction
```java
docker exec -it adhoc-1 /bin/bash
root@adhoc-1:/opt# /var/hoodie/ws/hudi-cli/hudi-cli.sh
-============================================
-* *
-* _ _ _ _ *
-* | | | | | | (_) *
-* | |__| | __| | - *
-* | __ || | / _` | || *
-* | | | || || (_| | || *
-* |_| |_|\___/ \____/ || *
-* *
-============================================
-
-Welcome to Hoodie CLI. Please type help if you are looking for help.
+...
+Table command getting loaded
+HoodieSplashScreen loaded
+===================================================================
+* ___ ___ *
+* /\__\ ___ /\ \ ___ *
+* / / / /\__\ / \ \ /\ \ *
+* / /__/ / / / / /\ \ \ \ \ \ *
+* / \ \ ___ / / / / / \ \__\ / \__\ *
+* / /\ \ /\__\ / /__/ ___ / /__/ \ |__| / /\/__/ *
+* \/ \ \/ / / \ \ \ /\__\ \ \ \ / / / /\/ / / *
+* \ / / \ \ / / / \ \ / / / \ /__/ *
+* / / / \ \/ / / \ \/ / / \ \__\ *
+* / / / \ / / \ / / \/__/ *
+* \/__/ \/__/ \/__/ Apache Hudi CLI *
+* *
+===================================================================
+
+Welcome to Apache Hudi CLI. Please type help if you are looking for help.
hudi->connect --path /user/hive/warehouse/stock_ticks_mor
18/09/24 06:59:34 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
18/09/24 06:59:35 INFO table.HoodieTableMetaClient: Loading
HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
18/09/24 06:59:35 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml,
mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml,
hdfs-default.xml, hdfs-site.xml], FileSystem:
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root
(auth:SIMPLE)]]]
-18/09/24 06:59:35 INFO table.HoodieTableConfig: Loading dataset properties
from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
-18/09/24 06:59:36 INFO table.HoodieTableMetaClient: Finished Loading Table of
type MERGE_ON_READ from /user/hive/warehouse/stock_ticks_mor
+18/09/24 06:59:35 INFO table.HoodieTableConfig: Loading table properties from
/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
+18/09/24 06:59:36 INFO table.HoodieTableMetaClient: Finished Loading Table of
type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor
Metadata for table stock_ticks_mor loaded
# Ensure no compactions are present
@@ -884,8 +894,8 @@ Compaction successfully completed for 20180924070031
hoodie:stock_ticks->connect --path /user/hive/warehouse/stock_ticks_mor
18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Loading
HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
18/09/24 07:01:16 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml,
mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml,
hdfs-default.xml, hdfs-site.xml], FileSystem:
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root
(auth:SIMPLE)]]]
-18/09/24 07:01:16 INFO table.HoodieTableConfig: Loading dataset properties
from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
-18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Finished Loading Table of
type MERGE_ON_READ from /user/hive/warehouse/stock_ticks_mor
+18/09/24 07:01:16 INFO table.HoodieTableConfig: Loading table properties from
/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
+18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Finished Loading Table of
type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor
Metadata for table stock_ticks_mor loaded
@@ -911,8 +921,8 @@ Compaction successfully completed for 20180924070031
hoodie:stock_ticks_mor->connect --path /user/hive/warehouse/stock_ticks_mor
18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Loading
HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
18/09/24 07:03:00 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml,
mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml,
hdfs-default.xml, hdfs-site.xml], FileSystem:
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root
(auth:SIMPLE)]]]
-18/09/24 07:03:00 INFO table.HoodieTableConfig: Loading dataset properties
from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
-18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Finished Loading Table of
type MERGE_ON_READ from /user/hive/warehouse/stock_ticks_mor
+18/09/24 07:03:00 INFO table.HoodieTableConfig: Loading table properties from
/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
+18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Finished Loading Table of
type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor
Metadata for table stock_ticks_mor loaded
@@ -928,7 +938,7 @@ hoodie:stock_ticks->compactions show all
### Step 9: Run Hive Queries including incremental queries
-You will see that both ReadOptimized and Realtime Views will show the latest
committed data.
+You will see that both ReadOptimized and Snapshot queries will show the latest
committed data.
Lets also run the incremental query for MOR table.
From looking at the below query output, it will be clear that the fist commit
time for the MOR table is 20180924064636
and the second commit time is 20180924070031
@@ -937,8 +947,8 @@ and the second commit time is 20180924070031
docker exec -it adhoc-2 /bin/bash
beeline -u jdbc:hive2://hiveserver:10000 --hiveconf
hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf
hive.stats.autogather=false
-# Read Optimized View
-0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor
group by symbol HAVING symbol = 'GOOG';
+# Read Optimized Query
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from
stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark, tez)
or using Hive 1.X releases.
+---------+----------------------+--+
| symbol | _c1 |
@@ -947,7 +957,7 @@ WARNING: Hive-on-MR is deprecated in Hive 2 and may not be
available in the futu
+---------+----------------------+--+
1 row selected (1.6 seconds)
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts,
volume, open, close from stock_ticks_mor where symbol = 'GOOG';
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts,
volume, open, close from stock_ticks_mor_ro where symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open
| close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
@@ -955,7 +965,7 @@ WARNING: Hive-on-MR is deprecated in Hive 2 and may not be
available in the futu
| 20180924070031 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993
| 1227.215 |
+----------------------+---------+----------------------+---------+------------+-----------+--+
-# Realtime View
+# Snapshot Query
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from
stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark, tez)
or using Hive 1.X releases.
+---------+----------------------+--+
@@ -972,7 +982,7 @@ WARNING: Hive-on-MR is deprecated in Hive 2 and may not be
available in the futu
| 20180924070031 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993
| 1227.215 |
+----------------------+---------+----------------------+---------+------------+-----------+--+
-# Incremental View:
+# Incremental Query:
0: jdbc:hive2://hiveserver:10000> set
hoodie.stock_ticks_mor.consume.mode=INCREMENTAL;
No rows affected (0.008 seconds)
@@ -982,7 +992,7 @@ No rows affected (0.007 seconds)
0: jdbc:hive2://hiveserver:10000> set
hoodie.stock_ticks_mor.consume.start.timestamp=20180924064636;
No rows affected (0.013 seconds)
# Query:
-0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts,
volume, open, close from stock_ticks_mor where symbol = 'GOOG' and
`_hoodie_commit_time` > '20180924064636';
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts,
volume, open, close from stock_ticks_mor_ro where symbol = 'GOOG' and
`_hoodie_commit_time` > '20180924064636';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open
| close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
@@ -992,14 +1002,14 @@ exit
exit
```
-### Step 10: Read Optimized and Realtime Views for MOR with Spark-SQL after
compaction
+### Step 10: Read Optimized and Snapshot queries for MOR with Spark-SQL after
compaction
```java
docker exec -it adhoc-1 /bin/bash
-bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE
--driver-class-path $HADOOP_CONF_DIR --conf
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client
--driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1
--packages com.databricks:spark-avro_2.11:4.0.0
+bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE
--driver-class-path $HADOOP_CONF_DIR --conf
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client
--driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1
--packages org.apache.spark:spark-avro_2.11:2.4.4
-# Read Optimized View
-scala> spark.sql("select symbol, max(ts) from stock_ticks_mor group by symbol
HAVING symbol = 'GOOG'").show(100, false)
+# Read Optimized Query
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by
symbol HAVING symbol = 'GOOG'").show(100, false)
+---------+----------------------+--+
| symbol | _c1 |
+---------+----------------------+--+
@@ -1007,7 +1017,7 @@ scala> spark.sql("select symbol, max(ts) from
stock_ticks_mor group by symbol HA
+---------+----------------------+--+
1 row selected (1.6 seconds)
-scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open,
close from stock_ticks_mor where symbol = 'GOOG'").show(100, false)
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open,
close from stock_ticks_mor_ro where symbol = 'GOOG'").show(100, false)
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open
| close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
@@ -1015,7 +1025,7 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol,
ts, volume, open, close
| 20180924070031 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993
| 1227.215 |
+----------------------+---------+----------------------+---------+------------+-----------+--+
-# Realtime View
+# Snapshot Query
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by
symbol HAVING symbol = 'GOOG'").show(100, false)
+---------+----------------------+--+
| symbol | _c1 |
@@ -1032,15 +1042,15 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol,
ts, volume, open, close
+----------------------+---------+----------------------+---------+------------+-----------+--+
```
-### Step 11: Presto queries over Read Optimized View on MOR dataset after
compaction
+### Step 11: Presto Read Optimized queries on MOR table after compaction
```java
docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
presto> use hive.default;
USE
-# Read Optimized View
-resto:default> select symbol, max(ts) from stock_ticks_mor group by symbol
HAVING symbol = 'GOOG';
+# Read Optimized Query
+resto:default> select symbol, max(ts) from stock_ticks_mor_ro group by symbol
HAVING symbol = 'GOOG';
symbol | _col1
--------+---------------------
GOOG | 2018-08-31 10:59:00
@@ -1050,7 +1060,7 @@ Query 20190822_182319_00011_segyw, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0:01 [197 rows, 613B] [133 rows/s, 414B/s]
-presto:default> select "_hoodie_commit_time", symbol, ts, volume, open, close
from stock_ticks_mor where symbol = 'GOOG';
+presto:default> select "_hoodie_commit_time", symbol, ts, volume, open, close
from stock_ticks_mor_ro where symbol = 'GOOG';
_hoodie_commit_time | symbol | ts | volume | open |
close
---------------------+--------+---------------------+--------+-----------+----------
20190822180250 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 |
1230.02
@@ -1076,7 +1086,7 @@ $ mvn pre-integration-test -DskipTests
```
The above command builds docker images for all the services with
current Hudi source installed at /var/hoodie/ws and also brings up the
services using a compose file. We
-currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.3.1) in docker
images.
+currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.4.4) in docker
images.
To bring down the containers
```java
diff --git a/docs/_docs/1_1_quick_start_guide.md
b/docs/_docs/1_1_quick_start_guide.md
index 708bd2c..d7d645e 100644
--- a/docs/_docs/1_1_quick_start_guide.md
+++ b/docs/_docs/1_1_quick_start_guide.md
@@ -16,10 +16,20 @@ Hudi works with Spark-2.x versions. You can follow
instructions [here](https://s
From the extracted directory run spark-shell with Hudi as:
```scala
-bin/spark-shell --packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating \
+spark-2.4.4-bin-hadoop2.7/bin/spark-shell --packages
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
\
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
```
+<div class="notice--info">
+ <h4>Please note the following: </h4>
+<ul>
+ <li>spark-avro module needs to be specified in --packages as it is not
included with spark-shell by default</li>
+ <li>spark-avro and spark versions must match (we have used 2.4.4 for both
above)</li>
+ <li>we have used hudi-spark-bundle built for scala 2.11 since the spark-avro
module used also depends on 2.11.
+ If spark-avro_2.12 is used, correspondingly hudi-spark-bundle_2.12
needs to be used. </li>
+</ul>
+</div>
+
Setup table name, base path and a data generator to generate records for this
guide.
```scala
@@ -83,7 +93,7 @@ spark.sql("select _hoodie_commit_time, _hoodie_record_key,
_hoodie_partition_pat
This query provides snapshot querying of the ingested data. Since our
partition path (`region/country/city`) is 3 levels nested
from base path we ve used `load(basePath + "/*/*/*/*")`.
-Refer to [Table types and queries](/docs/concepts#table-types--queries) for
more info on all table types and querying types supported.
+Refer to [Table types and queries](/docs/concepts#table-types--queries) for
more info on all table types and query types supported.
{: .notice--info}
## Update data
@@ -133,7 +143,7 @@ val tripsIncrementalDF = spark.
option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
load(basePath);
-tripsIncrementalDF.registerTempTable("hudi_trips_incremental")
+tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from
hudi_trips_incremental where fare > 20.0").show()
```
@@ -156,7 +166,7 @@ val tripsPointInTimeDF =
spark.read.format("org.apache.hudi").
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
option(END_INSTANTTIME_OPT_KEY, endTime).
load(basePath);
-tripsPointInTimeDF.registerTempTable("hudi_trips_point_in_time")
+tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from
hudi_trips_point_in_time where fare > 20.0").show()
```
@@ -196,8 +206,9 @@ Note: Only `Append` mode is supported for delete operation.
## Where to go from here?
You can also do the quickstart by [building hudi
yourself](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source),
-and using `--jars <path to
hudi_code>/packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar`
in the spark-shell command above
-instead of `--packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating`
+and using `--jars <path to
hudi_code>/packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar`
in the spark-shell command above
+instead of `--packages
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`. Hudi also supports
scala 2.12. Refer [build with scala
2.12](https://github.com/apache/incubator-hudi#build-with-scala-212)
+for more info.
Also, we used Spark here to show case the capabilities of Hudi. However, Hudi
can support multiple table types/query types and
Hudi tables can be queried from query engines like Hive, Spark, Presto and
much more. We have put together a
diff --git a/docs/_docs/1_2_structure.md b/docs/_docs/1_2_structure.md
index e080fcd..ddcdb1a 100644
--- a/docs/_docs/1_2_structure.md
+++ b/docs/_docs/1_2_structure.md
@@ -6,7 +6,7 @@ summary: "Hudi brings stream processing to big data, providing
fresh data while
last_modified_at: 2019-12-30T15:59:57-04:00
---
-Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical
tables over DFS
([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html)
or cloud stores) and provides three types of querying.
+Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical
tables over DFS
([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html)
or cloud stores) and provides three types of queries.
* **Read Optimized query** - Provides excellent query performance on pure
columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
* **Incremental query** - Provides a change stream out of the dataset to feed
downstream jobs/ETLs.
diff --git a/docs/_docs/2_1_concepts.md b/docs/_docs/2_1_concepts.md
index c99aa41..cf61811 100644
--- a/docs/_docs/2_1_concepts.md
+++ b/docs/_docs/2_1_concepts.md
@@ -141,11 +141,11 @@ The intention of copy on write table, is to fundamentally
improve how tables are
Merge on read table is a superset of copy on write, in the sense it still
supports read optimized queries of the table by exposing only the base/columnar
files in latest file slices.
Additionally, it stores incoming upserts for each file group, onto a row based
delta log, to support snapshot queries by applying the delta log,
-onto the latest version of each file id on-the-fly during query time. Thus,
this table type attempts to balance read and write amplication intelligently,
to provide near real-time data.
+onto the latest version of each file id on-the-fly during query time. Thus,
this table type attempts to balance read and write amplification intelligently,
to provide near real-time data.
The most significant change here, would be to the compactor, which now
carefully chooses which delta log files need to be compacted onto
their columnar base file, to keep the query performance in check (larger delta
log files would incur longer merge times with merge data on query side)
-Following illustrates how the table works, and shows two types of querying -
snapshot querying and read optimized querying.
+Following illustrates how the table works, and shows two types of queries -
snapshot query and read optimized query.
<figure>
<img class="docimage" src="/assets/images/hudi_mor.png" alt="hudi_mor.png"
style="max-width: 100%" />
@@ -158,7 +158,7 @@ There are lot of interesting things happening in this
example, which bring out t
all the data from 10:05 to 10:10. The base columnar files are still versioned
with the commit, as before.
Thus, if one were to simply look at base files alone, then the table layout
looks exactly like a copy on write table.
- A periodic compaction process reconciles these changes from the delta log
and produces a new version of base file, just like what happened at 10:05 in
the example.
- - There are two ways of querying the same underlying table: Read Optimized
querying and Snapshot querying, depending on whether we chose query performance
or freshness of data.
+ - There are two ways of querying the same underlying table: Read Optimized
query and Snapshot query, depending on whether we chose query performance or
freshness of data.
- The semantics around when data from a commit is available to a query
changes in a subtle way for a read optimized query. Note, that such a query
running at 10:10, wont see data after 10:05 above, while a snapshot query
always sees the freshest data.
- When we trigger compaction & what it decides to compact hold all the key to
solving these hard problems. By implementing a compacting
diff --git a/docs/_docs/2_2_writing_data.md b/docs/_docs/2_2_writing_data.md
index b407111..52ba503 100644
--- a/docs/_docs/2_2_writing_data.md
+++ b/docs/_docs/2_2_writing_data.md
@@ -43,23 +43,56 @@ Command line options describe capabilities in more detail
```java
[hoodie]$ spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
Usage: <main class> [options]
- Options:
+Options:
+ --checkpoint
+ Resume Delta Streamer from this checkpoint.
--commit-on-errors
- Commit even when some records failed to be written
+ Commit even when some records failed to be written
+ Default: false
+ --compact-scheduling-minshare
+ Minshare for compaction as defined in
+ https://spark.apache.org/docs/latest/job-scheduling.html
+ Default: 0
+ --compact-scheduling-weight
+ Scheduling weight for compaction as defined in
+ https://spark.apache.org/docs/latest/job-scheduling.html
+ Default: 1
+ --continuous
+ Delta Streamer runs in continuous mode running source-fetch -> Transform
+ -> Hudi Write in loop
+ Default: false
+ --delta-sync-scheduling-minshare
+ Minshare for delta sync as defined in
+ https://spark.apache.org/docs/latest/job-scheduling.html
+ Default: 0
+ --delta-sync-scheduling-weight
+ Scheduling weight for delta sync as defined in
+ https://spark.apache.org/docs/latest/job-scheduling.html
+ Default: 1
+ --disable-compaction
+ Compaction is enabled for MoR table by default. This flag disables it
Default: false
--enable-hive-sync
- Enable syncing to hive
- Default: false
+ Enable syncing to hive
+ Default: false
--filter-dupes
- Should duplicate records from source be dropped/filtered outbefore
- insert/bulk-insert
+ Should duplicate records from source be dropped/filtered out before
+ insert/bulk-insert
Default: false
--help, -h
- --hudi-conf
- Any configuration that can be set in the properties file (using the
CLI
- parameter "--propsFilePath") can also be passed command line using
this
- parameter
- Default: []
+
+ --hoodie-conf
+ Any configuration that can be set in the properties file (using the CLI
+ parameter "--propsFilePath") can also be passed command line using this
+ parameter
+ Default: []
+ --max-pending-compactions
+ Maximum number of outstanding inflight/requested compactions. Delta Sync
+ will not happen unlessoutstanding compactions is less than this number
+ Default: 5
+ --min-sync-interval-seconds
+ the min sync interval of each sync in continuous mode
+ Default: 0
--op
Takes one of these values : UPSERT (default), INSERT (use when input is
purely new data/inserts to gain speed)
@@ -69,19 +102,22 @@ Usage: <main class> [options]
subclass of HoodieRecordPayload, that works off a GenericRecord.
Implement your own, if you want to do something other than overwriting
existing value
- Default: org.apache.hudi.OverwriteWithLatestAvroPayload
+ Default: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
--props
path to properties file on localfs or dfs, with configurations for
- Hudi client, schema provider, key generator and data source. For
- Hudi client props, sane defaults are used, but recommend use to
+ hoodie client, schema provider, key generator and data source. For
+ hoodie client props, sane defaults are used, but recommend use to
provide basic things like metrics endpoints, hive configs etc. For
sources, referto individual classes, for supported properties.
Default:
file:///Users/vinoth/bin/hoodie/src/test/resources/delta-streamer-config/dfs-source.properties
--schemaprovider-class
subclass of org.apache.hudi.utilities.schema.SchemaProvider to attach
schemas to input & target table data, built in options:
- FilebasedSchemaProvider
- Default: org.apache.hudi.utilities.schema.FilebasedSchemaProvider
+ org.apache.hudi.utilities.schema.FilebasedSchemaProvider.Source (See
+ org.apache.hudi.utilities.sources.Source) implementation can implement
+ their own SchemaProvider. For Sources that return Dataset<Row>, the
+ schema is obtained implicitly. However, this CLI option allows
+ overriding the schemaprovider returned by Source.
--source-class
Subclass of org.apache.hudi.utilities.sources to read data. Built-in
options: org.apache.hudi.utilities.sources.{JsonDFSSource (default),
@@ -89,7 +125,7 @@ Usage: <main class> [options]
Default: org.apache.hudi.utilities.sources.JsonDFSSource
--source-limit
Maximum amount of data to read from source. Default: No limit For e.g:
- DFSSource => max bytes to read, KafkaSource => max events to read
+ DFS-Source => max bytes to read, Kafka-Source => max events to read
Default: 9223372036854775807
--source-ordering-field
Field within source record to decide how to break ties between records
@@ -99,17 +135,19 @@ Usage: <main class> [options]
--spark-master
spark master to use.
Default: local[2]
+ * --table-type
+ Type of table. COPY_ON_WRITE (or) MERGE_ON_READ
* --target-base-path
- base path for the target Hudi table. (Will be created if did not
- exist first time around. If exists, expected to be a Hudi table)
+ base path for the target hoodie table. (Will be created if did not exist
+ first time around. If exists, expected to be a hoodie table)
* --target-table
name of the target table in Hive
--transformer-class
- subclass of org.apache.hudi.utilities.transform.Transformer. UDF to
- transform raw source dataset to a target dataset (conforming to target
- schema) before writing. Default : Not set. E:g -
+ subclass of org.apache.hudi.utilities.transform.Transformer. Allows
+ transforming raw source Dataset to a target Dataset (conforming to
+ target schema) before writing. Default : Not set. E:g -
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer (which
- allows a SQL query template to be passed as a transformation function)
+ allows a SQL query templated to be passed as a transformation function)
```
The tool takes a hierarchically composed property file and has pluggable
interfaces for extracting data, key generation and providing schema. Sample
configs for ingesting from kafka and dfs are
diff --git a/docs/_docs/2_3_querying_data.md b/docs/_docs/2_3_querying_data.md
index 8c6d357..2d97e2b 100644
--- a/docs/_docs/2_3_querying_data.md
+++ b/docs/_docs/2_3_querying_data.md
@@ -15,12 +15,12 @@ Specifically, following Hive tables are registered based
off [table name](/docs/
and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) passed during
write.
If `table name = hudi_trips` and `table type = COPY_ON_WRITE`, then we get:
- - `hudi_trips` supports snapshot querying and incremental querying of the
table backed by `HoodieParquetInputFormat`, exposing purely columnar data.
+ - `hudi_trips` supports snapshot query and incremental query on the table
backed by `HoodieParquetInputFormat`, exposing purely columnar data.
If `table name = hudi_trips` and `table type = MERGE_ON_READ`, then we get:
- - `hudi_trips_rt` supports snapshot querying and incremental querying
(providing near-real time data) of the table backed by
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
- - `hudi_trips_ro` supports read optimized querying of the table backed by
`HoodieParquetInputFormat`, exposing purely columnar data.
+ - `hudi_trips_rt` supports snapshot query and incremental query (providing
near-real time data) on the table backed by
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+ - `hudi_trips_ro` supports read optimized query on the table backed by
`HoodieParquetInputFormat`, exposing purely columnar data.
As discussed in the concepts section, the one key primitive needed for
[incrementally
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
@@ -89,11 +89,11 @@ separated) and calls InputFormat.listStatus() only once
with all those partition
Spark provides much easier deployment & management of Hudi jars and bundles
into jobs/notebooks. At a high level, there are two ways to access Hudi tables
in Spark.
- **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to
how standard datasources (e.g: `spark.read.parquet`) work.
- - **Read as Hive tables** : Supports all three query types, including the
snapshot querying, relying on the custom Hudi input formats again like Hive.
+ - **Read as Hive tables** : Supports all three query types, including the
snapshot queries, relying on the custom Hudi input formats again like Hive.
- In general, your spark job needs a dependency to `hudi-spark` or
`hudi-spark-bundle-x.y.z.jar` needs to be on the class path of driver &
executors (hint: use `--jars` argument)
+ In general, your spark job needs a dependency to `hudi-spark` or
`hudi-spark-bundle_2.*-x.y.z.jar` needs to be on the class path of driver &
executors (hint: use `--jars` argument)
-### Read optimized querying
+### Read optimized query
Pushing a path filter into sparkContext as follows allows for read optimized
querying of a Hudi hive table using SparkSQL.
This method retains Spark built-in optimizations for reading Parquet files
like vectorized reading on Hudi tables.
@@ -110,12 +110,12 @@ Dataset<Row> hoodieROViewDF =
spark.read().format("org.apache.hudi")
.load("/glob/path/pattern");
```
-### Snapshot querying {#spark-snapshot-querying}
-Currently, near-real time data can only be queried as a Hive table in Spark
using snapshot querying mode. In order to do this, set
`spark.sql.hive.convertMetastoreParquet=false`, forcing Spark to fallback
+### Snapshot query {#spark-snapshot-query}
+Currently, near-real time data can only be queried as a Hive table in Spark
using snapshot query mode. In order to do this, set
`spark.sql.hive.convertMetastoreParquet=false`, forcing Spark to fallback
to using the Hive Serde to read the data (planning/executions is still Spark).
```java
-$ spark-shell --jars hudi-spark-bundle-x.y.z-SNAPSHOT.jar --driver-class-path
/etc/hive/conf --packages org.apache.spark:spark-avro_2.11:2.4.4 --conf
spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory
7g --executor-memory 2g --master yarn-client
+$ spark-shell --jars hudi-spark-bundle_2.11-x.y.z-SNAPSHOT.jar
--driver-class-path /etc/hive/conf --packages
org.apache.spark:spark-avro_2.11:2.4.4 --conf
spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory
7g --executor-memory 2g --master yarn-client
scala> sqlContext.sql("select count(*) from hudi_trips_rt where datestr =
'2016-10-02'").show()
scala> sqlContext.sql("select count(*) from hudi_trips_rt where datestr =
'2016-10-02'").show()
@@ -148,5 +148,5 @@ Additionally, `HoodieReadClient` offers the following
functionality using Hudi's
## Presto
-Presto is a popular query engine, providing interactive query performance.
Presto currently supports only read optimized querying on Hudi tables.
+Presto is a popular query engine, providing interactive query performance.
Presto currently supports only read optimized queries on Hudi tables.
This requires the `hudi-presto-bundle` jar to be placed into
`<presto_install>/plugin/hive-hadoop2/`, across the installation.
diff --git a/docs/_docs/2_5_performance.md b/docs/_docs/2_5_performance.md
index 6f489fc..580d180 100644
--- a/docs/_docs/2_5_performance.md
+++ b/docs/_docs/2_5_performance.md
@@ -41,9 +41,9 @@ For e.g , with 100M timestamp prefixed keys (5% updates, 95%
inserts) on a event
**~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a
challenging workload like an '100% update' database ingestion workload spanning
3.25B UUID keys/30 partitions/6180 files using 300 cores, Hudi indexing offers
a **80-100% speedup**.
-## Read Optimized Queries
+## Snapshot Queries
-The major design goal for read optimized querying is to achieve the latency
reduction & efficiency gains in previous section,
+The major design goal for snapshot queries is to achieve the latency reduction
& efficiency gains in previous section,
with no impact on queries. Following charts compare the Hudi vs non-Hudi
tables across Hive/Presto/Spark queries and demonstrate this.
**Hive**