bhasudha commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r369171577
########## File path: docs/_docs/2_3_querying_data.md ########## @@ -1,47 +1,52 @@ --- -title: Querying Hudi Datasets +title: Querying Hudi Tables keywords: hudi, hive, spark, sql, presto permalink: /docs/querying_data.html summary: In this page, we go over how to enable SQL queries on Hudi built tables. toc: true last_modified_at: 2019-12-30T15:59:57-04:00 --- -Conceptually, Hudi stores data physically once on DFS, while providing 3 logical views on top, as explained [before](/docs/concepts.html#views). -Once the dataset is synced to the Hive metastore, it provides external Hive tables backed by Hudi's custom inputformats. Once the proper hudi -bundle has been provided, the dataset can be queried by popular query engines like Hive, Spark and Presto. +Conceptually, Hudi stores data physically once on DFS, while providing 3 different ways of querying, as explained [before](/docs/concepts.html#query-types). +Once the table is synced to the Hive metastore, it provides external Hive tables backed by Hudi's custom inputformats. Once the proper hudi +bundle has been provided, the table can be queried by popular query engines like Hive, Spark and Presto. -Specifically, there are two Hive tables named off [table name](/docs/configurations.html#TABLE_NAME_OPT_KEY) passed during write. -For e.g, if `table name = hudi_tbl`, then we get +Specifically, following Hive tables are registered based off [table name](/docs/configurations.html#TABLE_NAME_OPT_KEY) +and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) passed during write. - - `hudi_tbl` realizes the read optimized view of the dataset backed by `HoodieParquetInputFormat`, exposing purely columnar data. - - `hudi_tbl_rt` realizes the real time view of the dataset backed by `HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data. +If `table name = hudi_trips` and `table type = COPY_ON_WRITE`, then we get: + - `hudi_trips` supports snapshot querying and incremental querying of the table backed by `HoodieParquetInputFormat`, exposing purely columnar data. + + +If `table name = hudi_trips` and `table type = MERGE_ON_READ`, then we get: + - `hudi_trips_rt` supports snapshot querying and incremental querying (providing near-real time data) of the table backed by `HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data. + - `hudi_trips_ro` supports read optimized querying of the table backed by `HoodieParquetInputFormat`, exposing purely columnar data. + As discussed in the concepts section, the one key primitive needed for [incrementally processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop), -is `incremental pulls` (to obtain a change stream/log from a dataset). Hudi datasets can be pulled incrementally, which means you can get ALL and ONLY the updated & new rows +is `incremental pulls` (to obtain a change stream/log from a table). Hudi tables can be pulled incrementally, which means you can get ALL and ONLY the updated & new rows since a specified instant time. This, together with upserts, are particularly useful for building data pipelines where 1 or more source Hudi tables are incrementally pulled (streams/facts), -joined with other tables (datasets/dimensions), to [write out deltas](/docs/writing_data.html) to a target Hudi dataset. Incremental view is realized by querying one of the tables above, -with special configurations that indicates to query planning that only incremental data needs to be fetched out of the dataset. +joined with other tables (tables/dimensions), to [write out deltas](/docs/writing_data.html) to a target Hudi table. Incremental view is realized by querying one of the tables above, +with special configurations that indicates to query planning that only incremental data needs to be fetched out of the table. -In sections, below we will discuss in detail how to access all the 3 views on each query engine. +In sections, below we will discuss how to access these query types from different query engines. ## Hive -In order for Hive to recognize Hudi datasets and query correctly, the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` +In order for Hive to recognize Hudi tables and query correctly, the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` in its [aux jars path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr). This will ensure the input format classes with its dependencies are available for query planning & execution. -### Read Optimized table +### Read optimized querying In addition to setup above, for beeline cli access, the `hive.input.format` variable needs to be set to the fully qualified path name of the inputformat `org.apache.hudi.hadoop.HoodieParquetInputFormat`. For Tez, additionally the `hive.tez.input.format` needs to be set to `org.apache.hadoop.hive.ql.io.HiveInputFormat` -### Real time table +### Snapshot querying In addition to installing the hive bundle jar on the HiveServer2, it needs to be put on the hadoop/hive installation across the cluster, so that queries can pick up the custom RecordReader as well. -### Incremental Pulling - +### Incremental pulling Review comment: done! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services