[incubator-hudi] branch asf-site updated: [HUDI-589] Follow on fixes to querying_data page

bhavanisudha Mon, 02 Mar 2020 14:05:06 -0800

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 7e82fe2  [HUDI-589] Follow on fixes to querying_data page
7e82fe2 is described below

commit 7e82fe2f1f1137ae946039b80ad3246abc3af7a3
Author: Vinoth Chandar <[email protected]>
AuthorDate: Mon Mar 2 13:43:39 2020 -0800

    [HUDI-589] Follow on fixes to querying_data page
---
 docs/_docs/2_3_querying_data.md | 90 ++++++++++++++++++++---------------------
 1 file changed, 45 insertions(+), 45 deletions(-)

diff --git a/docs/_docs/2_3_querying_data.md b/docs/_docs/2_3_querying_data.md
index 1a2ae08..c4ab865 100644
--- a/docs/_docs/2_3_querying_data.md
+++ b/docs/_docs/2_3_querying_data.md
@@ -9,10 +9,10 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 
 Conceptually, Hudi stores data physically once on DFS, while providing 3 
different ways of querying, as explained 
[before](/docs/concepts.html#query-types). 
 Once the table is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark SQL, Spark datasource and Presto.
+bundle has been installed, the table can be queried by popular query engines 
like Hive, Spark SQL, Spark Datasource API and Presto.
 
 Specifically, following Hive tables are registered based off [table 
name](/docs/configurations.html#TABLE_NAME_OPT_KEY) 
-and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) passed during 
write.   
+and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) configs passed 
during write.   
 
 If `table name = hudi_trips` and `table type = COPY_ON_WRITE`, then we get: 
  - `hudi_trips` supports snapshot query and incremental query on the table 
backed by `HoodieParquetInputFormat`, exposing purely columnar data.
@@ -20,37 +20,39 @@ If `table name = hudi_trips` and `table type = 
COPY_ON_WRITE`, then we get:
 
 If `table name = hudi_trips` and `table type = MERGE_ON_READ`, then we get:
  - `hudi_trips_rt` supports snapshot query and incremental query (providing 
near-real time data) on the table  backed by 
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
- - `hudi_trips_ro` supports read optimized query on the table backed by 
`HoodieParquetInputFormat`, exposing purely columnar data.
- 
+ - `hudi_trips_ro` supports read optimized query on the table backed by 
`HoodieParquetInputFormat`, exposing purely columnar data stored in base files.
 
-As discussed in the concepts section, the one key primitive needed for 
[incrementally 
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
+As discussed in the concepts section, the one key capability needed for 
[incrementally 
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
 is obtaining a change stream/log from a table. Hudi tables can be queried 
incrementally, which means you can get ALL and ONLY the updated & new rows 
 since a specified instant time. This, together with upserts, is particularly 
useful for building data pipelines where 1 or more source Hudi tables are 
incrementally queried (streams/facts),
 joined with other tables (tables/dimensions), to [write out 
deltas](/docs/writing_data.html) to a target Hudi table. Incremental queries 
are realized by querying one of the tables above, 
 with special configurations that indicates to query planning that only 
incremental data needs to be fetched out of the table. 
 
 
-## SUPPORT MATRIX
+## Support Matrix
+
+Following tables show whether a given query is supported on specific query 
engine.
 
-### COPY_ON_WRITE tables
+### Copy-On-Write tables
   
-||Snapshot|Incremental|Read Optimized|
-||--------|-----------|--------------|
-|**Hive**|Y|Y|N/A|
-|**Spark SQL**|Y|Y|N/A|
-|**Spark datasource**|Y|Y|N/A|
-|**Presto**|Y|N|N/A|
+|Query Engine|Snapshot Queries|Incremental Queries|
+|------------|--------|-----------|
+|**Hive**|Y|Y|
+|**Spark SQL**|Y|Y|
+|**Spark Datasource**|Y|Y|
+|**Presto**|Y|N|
+
+Note that `Read Optimized` queries are not applicable for COPY_ON_WRITE tables.
 
-### MERGE_ON_READ tables
+### Merge-On-Read tables
 
-||Snapshot|Incremental|Read Optimized|
-||--------|-----------|--------------|
+|Query Engine|Snapshot Queries|Incremental Queries|Read Optimized Queries|
+|------------|--------|-----------|--------------|
 |**Hive**|Y|Y|Y|
 |**Spark SQL**|Y|Y|Y|
-|**Spark datasource**|N|N|Y|
+|**Spark Datasource**|N|N|Y|
 |**Presto**|N|N|Y|
 
-
 In sections, below we will discuss specific setup to access different query 
types from different query engines. 
 
 ## Hive
@@ -103,13 +105,11 @@ would ensure Map Reduce execution is chosen for a Hive 
query, which combines par
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
 ## Spark SQL
-Supports all query types across both Hudi table types, relying on the custom 
Hudi input formats again like Hive. 
-Typically notebook users and spark-shell users leverage spark sql for querying 
Hudi tables. Please add hudi-spark-bundle 
-as described above via --jars or --packages.
+Once the Hudi tables have been registered to the Hive metastore, it can be 
queried using the Spark-Hive integration. It supports all query types across 
both Hudi table types, 
+relying on the custom Hudi input formats again like Hive. Typically notebook 
users and spark-shell users leverage spark sql for querying Hudi tables. Please 
add hudi-spark-bundle as described above via --jars or --packages.
  
-### Snapshot query {#spark-snapshot-query}
-By default, Spark SQL will try to use its own parquet reader instead of Hive 
SerDe when reading from Hive metastore parquet tables. 
-However, for MERGE_ON_READ tables which has both parquet and avro data, this 
default setting needs to be turned off using set 
`spark.sql.hive.convertMetastoreParquet=false`. 
+By default, Spark SQL will try to use its own parquet reader instead of Hive 
SerDe when reading from Hive metastore parquet tables. However, for 
MERGE_ON_READ tables which has 
+both parquet and avro data, this default setting needs to be turned off using 
set `spark.sql.hive.convertMetastoreParquet=false`. 
 This will force Spark to fallback to using the Hive Serde to read the data 
(planning/executions is still Spark). 
 
 ```java
@@ -126,23 +126,30 @@ If using spark's built in support, additionally a path 
filter needs to be pushed
 
spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
 classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], 
classOf[org.apache.hadoop.fs.PathFilter]);
 ```
 
-### Incremental querying {#spark-incr-query}
-Incremental queries work like hive incremental queries. The `hudi-spark` 
module offers the DataSource API, a more elegant way to query data from Hudi 
table and process it via Spark.
-A sample incremental query, that will obtain all records written since 
`beginInstantTime`, looks like below.
+## Spark Datasource
+
+The Spark Datasource API is a popular way of authoring Spark ETL pipelines. 
Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard 
+datasources work (e.g: `spark.read.parquet`). Both snapshot querying and 
incremental querying are supported here. Typically spark jobs require adding 
`--jars <path to jar>/hudi-spark-bundle_2.11-<hudi version>.jar` to classpath 
of drivers 
+and executors. Alternatively, hudi-spark-bundle can also fetched via the 
`--packages` options (e.g: `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`).
+
+
+### Incremental query {#spark-incr-query}
+Of special interest to spark pipelines, is Hudi's ability to support 
incremental queries, like below. A sample incremental query, that will obtain 
all records written since `beginInstantTime`, looks like below.
+Thanks to Hudi's support for record level change streams, these incremental 
pipelines often offer 10x efficiency over batch counterparts, by only 
processing the changed records.
+The following snippet shows how to obtain all records changed after 
`beginInstantTime` and run some SQL on them.
 
 ```java
  Dataset<Row> hudiIncQueryDF = spark.read()
      .format("org.apache.hudi")
-     .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(),
-             DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
-     .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),
-            <beginInstantTime>)
+     .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(), 
DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
+     .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(), 
<beginInstantTime>)
      .load(tablePath); // For incremental query, pass in the root/base path of 
table
      
 hudiIncQueryDF.createOrReplaceTempView("hudi_trips_incremental")
 spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  
hudi_trips_incremental where fare > 20.0").show()
 ```
 
+For examples, refer to [Setup spark-shell in 
quickstart](/docs/quick-start-guide.html#setup-spark-shell). 
 Please refer to [configurations](/docs/configurations.html#spark-datasource) 
section, to view all datasource options.
 
 Additionally, `HoodieReadClient` offers the following functionality using 
Hudi's implicit indexing.
@@ -150,27 +157,20 @@ Additionally, `HoodieReadClient` offers the following 
functionality using Hudi's
 | **API** | **Description** |
 |-------|--------|
 | read(keys) | Read out the data corresponding to the keys as a DataFrame, 
using Hudi's own index for faster lookup |
-| filterExists() | Filter out already existing records from the provided 
RDD[HoodieRecord]. Useful for de-duplication |
+| filterExists() | Filter out already existing records from the provided 
`RDD[HoodieRecord]`. Useful for de-duplication |
 | checkExists(keys) | Check if the provided keys exist in a Hudi table |
 
-## Spark datasource
-
-Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
-Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars <path to 
jar>/hudi-spark-bundle_2.11:0.5.1-incubating`
-to classpath of drivers and executors. When using spark shell instead of 
`--jars`, `--packages` can also be used to fetch the hudi-spark-bundle like 
this: `--packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
-For examples, refer to [Setup spark-shell in 
quickstart](/docs/quick-start-guide.html#setup-spark-shell).
-
 ## Presto
 
-Presto is a popular query engine, providing interactive query performance. 
Presto currently supports snapshot queries on
-COPY_ON_WRITE and read optimized queries on MERGE_ON_READ Hudi tables. This 
requires the `hudi-presto-bundle` jar
-to be placed into `<presto_install>/plugin/hive-hadoop2/`, across the 
installation.
+Presto is a popular query engine, providing interactive query performance. 
Presto currently supports snapshot queries on COPY_ON_WRITE and read optimized 
queries 
+on MERGE_ON_READ Hudi tables. This requires the `hudi-presto-bundle` jar to be 
placed into `<presto_install>/plugin/hive-hadoop2/`, across the installation.
+
+## Impala (Not Officially Released)
 
-## Impala(Not Officially Released)
+### Snapshot Query
 
-### Read optimized table
+Impala is able to query Hudi Copy-on-write table as an [EXTERNAL 
TABLE](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_tables.html#external_tables)
 on HDFS.  
 
-Impala is able to query Hudi read optimized table as an [EXTERNAL 
TABLE](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_tables.html#external_tables)
 on HDFS.  
 To create a Hudi read optimized table on Impala:
 ```
 CREATE EXTERNAL TABLE database.table_name

[incubator-hudi] branch asf-site updated: [HUDI-589] Follow on fixes to querying_data page

Reply via email to