This is an automated email from the ASF dual-hosted git repository.
bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 7e82fe2 [HUDI-589] Follow on fixes to querying_data page
7e82fe2 is described below
commit 7e82fe2f1f1137ae946039b80ad3246abc3af7a3
Author: Vinoth Chandar <[email protected]>
AuthorDate: Mon Mar 2 13:43:39 2020 -0800
[HUDI-589] Follow on fixes to querying_data page
---
docs/_docs/2_3_querying_data.md | 90 ++++++++++++++++++++---------------------
1 file changed, 45 insertions(+), 45 deletions(-)
diff --git a/docs/_docs/2_3_querying_data.md b/docs/_docs/2_3_querying_data.md
index 1a2ae08..c4ab865 100644
--- a/docs/_docs/2_3_querying_data.md
+++ b/docs/_docs/2_3_querying_data.md
@@ -9,10 +9,10 @@ last_modified_at: 2019-12-30T15:59:57-04:00
Conceptually, Hudi stores data physically once on DFS, while providing 3
different ways of querying, as explained
[before](/docs/concepts.html#query-types).
Once the table is synced to the Hive metastore, it provides external Hive
tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the table can be queried by popular query engines
like Hive, Spark SQL, Spark datasource and Presto.
+bundle has been installed, the table can be queried by popular query engines
like Hive, Spark SQL, Spark Datasource API and Presto.
Specifically, following Hive tables are registered based off [table
name](/docs/configurations.html#TABLE_NAME_OPT_KEY)
-and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) passed during
write.
+and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) configs passed
during write.
If `table name = hudi_trips` and `table type = COPY_ON_WRITE`, then we get:
- `hudi_trips` supports snapshot query and incremental query on the table
backed by `HoodieParquetInputFormat`, exposing purely columnar data.
@@ -20,37 +20,39 @@ If `table name = hudi_trips` and `table type =
COPY_ON_WRITE`, then we get:
If `table name = hudi_trips` and `table type = MERGE_ON_READ`, then we get:
- `hudi_trips_rt` supports snapshot query and incremental query (providing
near-real time data) on the table backed by
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
- - `hudi_trips_ro` supports read optimized query on the table backed by
`HoodieParquetInputFormat`, exposing purely columnar data.
-
+ - `hudi_trips_ro` supports read optimized query on the table backed by
`HoodieParquetInputFormat`, exposing purely columnar data stored in base files.
-As discussed in the concepts section, the one key primitive needed for
[incrementally
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
+As discussed in the concepts section, the one key capability needed for
[incrementally
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
is obtaining a change stream/log from a table. Hudi tables can be queried
incrementally, which means you can get ALL and ONLY the updated & new rows
since a specified instant time. This, together with upserts, is particularly
useful for building data pipelines where 1 or more source Hudi tables are
incrementally queried (streams/facts),
joined with other tables (tables/dimensions), to [write out
deltas](/docs/writing_data.html) to a target Hudi table. Incremental queries
are realized by querying one of the tables above,
with special configurations that indicates to query planning that only
incremental data needs to be fetched out of the table.
-## SUPPORT MATRIX
+## Support Matrix
+
+Following tables show whether a given query is supported on specific query
engine.
-### COPY_ON_WRITE tables
+### Copy-On-Write tables
-||Snapshot|Incremental|Read Optimized|
-||--------|-----------|--------------|
-|**Hive**|Y|Y|N/A|
-|**Spark SQL**|Y|Y|N/A|
-|**Spark datasource**|Y|Y|N/A|
-|**Presto**|Y|N|N/A|
+|Query Engine|Snapshot Queries|Incremental Queries|
+|------------|--------|-----------|
+|**Hive**|Y|Y|
+|**Spark SQL**|Y|Y|
+|**Spark Datasource**|Y|Y|
+|**Presto**|Y|N|
+
+Note that `Read Optimized` queries are not applicable for COPY_ON_WRITE tables.
-### MERGE_ON_READ tables
+### Merge-On-Read tables
-||Snapshot|Incremental|Read Optimized|
-||--------|-----------|--------------|
+|Query Engine|Snapshot Queries|Incremental Queries|Read Optimized Queries|
+|------------|--------|-----------|--------------|
|**Hive**|Y|Y|Y|
|**Spark SQL**|Y|Y|Y|
-|**Spark datasource**|N|N|Y|
+|**Spark Datasource**|N|N|Y|
|**Presto**|N|N|Y|
-
In sections, below we will discuss specific setup to access different query
types from different query engines.
## Hive
@@ -103,13 +105,11 @@ would ensure Map Reduce execution is chosen for a Hive
query, which combines par
separated) and calls InputFormat.listStatus() only once with all those
partitions.
## Spark SQL
-Supports all query types across both Hudi table types, relying on the custom
Hudi input formats again like Hive.
-Typically notebook users and spark-shell users leverage spark sql for querying
Hudi tables. Please add hudi-spark-bundle
-as described above via --jars or --packages.
+Once the Hudi tables have been registered to the Hive metastore, it can be
queried using the Spark-Hive integration. It supports all query types across
both Hudi table types,
+relying on the custom Hudi input formats again like Hive. Typically notebook
users and spark-shell users leverage spark sql for querying Hudi tables. Please
add hudi-spark-bundle as described above via --jars or --packages.
-### Snapshot query {#spark-snapshot-query}
-By default, Spark SQL will try to use its own parquet reader instead of Hive
SerDe when reading from Hive metastore parquet tables.
-However, for MERGE_ON_READ tables which has both parquet and avro data, this
default setting needs to be turned off using set
`spark.sql.hive.convertMetastoreParquet=false`.
+By default, Spark SQL will try to use its own parquet reader instead of Hive
SerDe when reading from Hive metastore parquet tables. However, for
MERGE_ON_READ tables which has
+both parquet and avro data, this default setting needs to be turned off using
set `spark.sql.hive.convertMetastoreParquet=false`.
This will force Spark to fallback to using the Hive Serde to read the data
(planning/executions is still Spark).
```java
@@ -126,23 +126,30 @@ If using spark's built in support, additionally a path
filter needs to be pushed
spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter],
classOf[org.apache.hadoop.fs.PathFilter]);
```
-### Incremental querying {#spark-incr-query}
-Incremental queries work like hive incremental queries. The `hudi-spark`
module offers the DataSource API, a more elegant way to query data from Hudi
table and process it via Spark.
-A sample incremental query, that will obtain all records written since
`beginInstantTime`, looks like below.
+## Spark Datasource
+
+The Spark Datasource API is a popular way of authoring Spark ETL pipelines.
Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how
standard
+datasources work (e.g: `spark.read.parquet`). Both snapshot querying and
incremental querying are supported here. Typically spark jobs require adding
`--jars <path to jar>/hudi-spark-bundle_2.11-<hudi version>.jar` to classpath
of drivers
+and executors. Alternatively, hudi-spark-bundle can also fetched via the
`--packages` options (e.g: `--packages
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`).
+
+
+### Incremental query {#spark-incr-query}
+Of special interest to spark pipelines, is Hudi's ability to support
incremental queries, like below. A sample incremental query, that will obtain
all records written since `beginInstantTime`, looks like below.
+Thanks to Hudi's support for record level change streams, these incremental
pipelines often offer 10x efficiency over batch counterparts, by only
processing the changed records.
+The following snippet shows how to obtain all records changed after
`beginInstantTime` and run some SQL on them.
```java
Dataset<Row> hudiIncQueryDF = spark.read()
.format("org.apache.hudi")
- .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(),
- DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
- .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),
- <beginInstantTime>)
+ .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(),
DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
+ .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),
<beginInstantTime>)
.load(tablePath); // For incremental query, pass in the root/base path of
table
hudiIncQueryDF.createOrReplaceTempView("hudi_trips_incremental")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from
hudi_trips_incremental where fare > 20.0").show()
```
+For examples, refer to [Setup spark-shell in
quickstart](/docs/quick-start-guide.html#setup-spark-shell).
Please refer to [configurations](/docs/configurations.html#spark-datasource)
section, to view all datasource options.
Additionally, `HoodieReadClient` offers the following functionality using
Hudi's implicit indexing.
@@ -150,27 +157,20 @@ Additionally, `HoodieReadClient` offers the following
functionality using Hudi's
| **API** | **Description** |
|-------|--------|
| read(keys) | Read out the data corresponding to the keys as a DataFrame,
using Hudi's own index for faster lookup |
-| filterExists() | Filter out already existing records from the provided
RDD[HoodieRecord]. Useful for de-duplication |
+| filterExists() | Filter out already existing records from the provided
`RDD[HoodieRecord]`. Useful for de-duplication |
| checkExists(keys) | Check if the provided keys exist in a Hudi table |
-## Spark datasource
-
-Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how
standard datasources work (e.g: `spark.read.parquet`).
-Both snapshot querying and incremental querying are supported here. Typically
spark jobs require adding `--jars <path to
jar>/hudi-spark-bundle_2.11:0.5.1-incubating`
-to classpath of drivers and executors. When using spark shell instead of
`--jars`, `--packages` can also be used to fetch the hudi-spark-bundle like
this: `--packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
-For examples, refer to [Setup spark-shell in
quickstart](/docs/quick-start-guide.html#setup-spark-shell).
-
## Presto
-Presto is a popular query engine, providing interactive query performance.
Presto currently supports snapshot queries on
-COPY_ON_WRITE and read optimized queries on MERGE_ON_READ Hudi tables. This
requires the `hudi-presto-bundle` jar
-to be placed into `<presto_install>/plugin/hive-hadoop2/`, across the
installation.
+Presto is a popular query engine, providing interactive query performance.
Presto currently supports snapshot queries on COPY_ON_WRITE and read optimized
queries
+on MERGE_ON_READ Hudi tables. This requires the `hudi-presto-bundle` jar to be
placed into `<presto_install>/plugin/hive-hadoop2/`, across the installation.
+
+## Impala (Not Officially Released)
-## Impala(Not Officially Released)
+### Snapshot Query
-### Read optimized table
+Impala is able to query Hudi Copy-on-write table as an [EXTERNAL
TABLE](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_tables.html#external_tables)
on HDFS.
-Impala is able to query Hudi read optimized table as an [EXTERNAL
TABLE](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_tables.html#external_tables)
on HDFS.
To create a Hudi read optimized table on Impala:
```
CREATE EXTERNAL TABLE database.table_name