bhasudha commented on a change in pull request #1260: [WIP] [HUDI-510] Update 
site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r369171577
 
 

 ##########
 File path: docs/_docs/2_3_querying_data.md
 ##########
 @@ -1,47 +1,52 @@
 ---
-title: Querying Hudi Datasets
+title: Querying Hudi Tables
 keywords: hudi, hive, spark, sql, presto
 permalink: /docs/querying_data.html
 summary: In this page, we go over how to enable SQL queries on Hudi built 
tables.
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Conceptually, Hudi stores data physically once on DFS, while providing 3 
logical views on top, as explained [before](/docs/concepts.html#views). 
-Once the dataset is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the dataset can be queried by popular query engines 
like Hive, Spark and Presto.
+Conceptually, Hudi stores data physically once on DFS, while providing 3 
different ways of querying, as explained 
[before](/docs/concepts.html#query-types). 
+Once the table is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi's custom inputformats. Once the proper hudi
+bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark and Presto.
 
-Specifically, there are two Hive tables named off [table 
name](/docs/configurations.html#TABLE_NAME_OPT_KEY) passed during write. 
-For e.g, if `table name = hudi_tbl`, then we get  
+Specifically, following Hive tables are registered based off [table 
name](/docs/configurations.html#TABLE_NAME_OPT_KEY) 
+and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) passed during 
write.   
 
- - `hudi_tbl` realizes the read optimized view of the dataset backed by 
`HoodieParquetInputFormat`, exposing purely columnar data.
- - `hudi_tbl_rt` realizes the real time view of the dataset  backed by 
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+If `table name = hudi_trips` and `table type = COPY_ON_WRITE`, then we get: 
+ - `hudi_trips` supports snapshot querying and incremental querying of the 
table backed by `HoodieParquetInputFormat`, exposing purely columnar data.
+
+
+If `table name = hudi_trips` and `table type = MERGE_ON_READ`, then we get:
+ - `hudi_trips_rt` supports snapshot querying and incremental querying 
(providing near-real time data) of the table  backed by 
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+ - `hudi_trips_ro` supports read optimized querying of the table backed by 
`HoodieParquetInputFormat`, exposing purely columnar data.
+ 
 
 As discussed in the concepts section, the one key primitive needed for 
[incrementally 
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
-is `incremental pulls` (to obtain a change stream/log from a dataset). Hudi 
datasets can be pulled incrementally, which means you can get ALL and ONLY the 
updated & new rows 
+is `incremental pulls` (to obtain a change stream/log from a table). Hudi 
tables can be pulled incrementally, which means you can get ALL and ONLY the 
updated & new rows 
 since a specified instant time. This, together with upserts, are particularly 
useful for building data pipelines where 1 or more source Hudi tables are 
incrementally pulled (streams/facts),
-joined with other tables (datasets/dimensions), to [write out 
deltas](/docs/writing_data.html) to a target Hudi dataset. Incremental view is 
realized by querying one of the tables above, 
-with special configurations that indicates to query planning that only 
incremental data needs to be fetched out of the dataset. 
+joined with other tables (tables/dimensions), to [write out 
deltas](/docs/writing_data.html) to a target Hudi table. Incremental view is 
realized by querying one of the tables above, 
+with special configurations that indicates to query planning that only 
incremental data needs to be fetched out of the table. 
 
-In sections, below we will discuss in detail how to access all the 3 views on 
each query engine.
+In sections, below we will discuss how to access these query types from 
different query engines.
 
 ## Hive
 
-In order for Hive to recognize Hudi datasets and query correctly, the 
HiveServer2 needs to be provided with the 
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` 
+In order for Hive to recognize Hudi tables and query correctly, the 
HiveServer2 needs to be provided with the 
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` 
 in its [aux jars 
path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr).
 This will ensure the input format 
 classes with its dependencies are available for query planning & execution. 
 
-### Read Optimized table
+### Read optimized querying
 In addition to setup above, for beeline cli access, the `hive.input.format` 
variable needs to be set to the  fully qualified path name of the 
 inputformat `org.apache.hudi.hadoop.HoodieParquetInputFormat`. For Tez, 
additionally the `hive.tez.input.format` needs to be set 
 to `org.apache.hadoop.hive.ql.io.HiveInputFormat`
 
-### Real time table
+### Snapshot querying
 In addition to installing the hive bundle jar on the HiveServer2, it needs to 
be put on the hadoop/hive installation across the cluster, so that
 queries can pick up the custom RecordReader as well.
 
-### Incremental Pulling
-
+### Incremental pulling
 
 Review comment:
   done!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to