[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

GitBox Tue, 21 Jan 2020 10:34:13 -0800

bhasudha commented on a change in pull request #1260: [WIP] [HUDI-510] Update 
site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r369171577

##########
File path: docs/_docs/2_3_querying_data.md
##########
@@ -1,47 +1,52 @@
---
-title: Querying Hudi Datasets
+title: Querying Hudi Tables
keywords: hudi, hive, spark, sql, presto
permalink: /docs/querying_data.html
summary: In this page, we go over how to enable SQL queries on Hudi built
tables.
toc: true
last_modified_at: 2019-12-30T15:59:57-04:00
---

-Conceptually, Hudi stores data physically once on DFS, while providing 3
logical views on top, as explained [before](/docs/concepts.html#views).
-Once the dataset is synced to the Hive metastore, it provides external Hive
tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the dataset can be queried by popular query engines
like Hive, Spark and Presto.
+Conceptually, Hudi stores data physically once on DFS, while providing 3
different ways of querying, as explained
[before](/docs/concepts.html#query-types).
+Once the table is synced to the Hive metastore, it provides external Hive
tables backed by Hudi's custom inputformats. Once the proper hudi
+bundle has been provided, the table can be queried by popular query engines
like Hive, Spark and Presto.

-Specifically, there are two Hive tables named off [table
name](/docs/configurations.html#TABLE_NAME_OPT_KEY) passed during write.
-For e.g, if `table name = hudi_tbl`, then we get
+Specifically, following Hive tables are registered based off [table
name](/docs/configurations.html#TABLE_NAME_OPT_KEY)
+and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) passed during
write.

- - `hudi_tbl` realizes the read optimized view of the dataset backed by
`HoodieParquetInputFormat`, exposing purely columnar data.
- - `hudi_tbl_rt` realizes the real time view of the dataset backed by
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+If `table name = hudi_trips` and `table type = COPY_ON_WRITE`, then we get:
+ - `hudi_trips` supports snapshot querying and incremental querying of the
table backed by `HoodieParquetInputFormat`, exposing purely columnar data.
+
+
+If `table name = hudi_trips` and `table type = MERGE_ON_READ`, then we get:
+ - `hudi_trips_rt` supports snapshot querying and incremental querying
(providing near-real time data) of the table backed by
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+ - `hudi_trips_ro` supports read optimized querying of the table backed by
`HoodieParquetInputFormat`, exposing purely columnar data.
+

As discussed in the concepts section, the one key primitive needed for
[incrementally
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
-is `incremental pulls` (to obtain a change stream/log from a dataset). Hudi
datasets can be pulled incrementally, which means you can get ALL and ONLY the
updated & new rows
+is `incremental pulls` (to obtain a change stream/log from a table). Hudi
tables can be pulled incrementally, which means you can get ALL and ONLY the
updated & new rows
since a specified instant time. This, together with upserts, are particularly
useful for building data pipelines where 1 or more source Hudi tables are
incrementally pulled (streams/facts),
-joined with other tables (datasets/dimensions), to [write out
deltas](/docs/writing_data.html) to a target Hudi dataset. Incremental view is
realized by querying one of the tables above,
-with special configurations that indicates to query planning that only
incremental data needs to be fetched out of the dataset.
+joined with other tables (tables/dimensions), to [write out
deltas](/docs/writing_data.html) to a target Hudi table. Incremental view is
realized by querying one of the tables above,
+with special configurations that indicates to query planning that only
incremental data needs to be fetched out of the table.

-In sections, below we will discuss in detail how to access all the 3 views on
each query engine.
+In sections, below we will discuss how to access these query types from
different query engines.

## Hive

-In order for Hive to recognize Hudi datasets and query correctly, the
HiveServer2 needs to be provided with the
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar`
+In order for Hive to recognize Hudi tables and query correctly, the
HiveServer2 needs to be provided with the
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar`
in its [aux jars
path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr).
This will ensure the input format
classes with its dependencies are available for query planning & execution.

-### Read Optimized table
+### Read optimized querying
In addition to setup above, for beeline cli access, the `hive.input.format`
variable needs to be set to the fully qualified path name of the
inputformat `org.apache.hudi.hadoop.HoodieParquetInputFormat`. For Tez,
additionally the `hive.tez.input.format` needs to be set
to `org.apache.hadoop.hive.ql.io.HiveInputFormat`

-### Real time table
+### Snapshot querying
In addition to installing the hive bundle jar on the HiveServer2, it needs to
be put on the hadoop/hive installation across the cluster, so that
queries can pick up the custom RecordReader as well.

-### Incremental Pulling
-
+### Incremental pulling

Review comment:
done!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

Reply via email to