[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368784217 ## File path: docs/_docs/2_2_writing_data.md ## @@ -156,41 +157,31 @@ inputDF.write() ## Syncing to Hive -Both tools above support syncing of the dataset's latest schema to Hive metastore, such that queries can pick up new columns and partitions. +Both tools above support syncing of the table's latest schema to Hive metastore, such that queries can pick up new columns and partitions. In case, its preferable to run this from commandline or in an independent jvm, Hudi provides a `HiveSyncTool`, which can be invoked as below, -once you have built the hudi-hive module. +once you have built the hudi-hive module. Following is how we sync the above Datasource Writer written table to Hive metastore. + +```java +cd hudi-hive +./run_sync_tool.sh --jdbc-url jdbc:hive2:\/\/hiveserver:1 --user hive --pass hive --partitioned-by partition --base-path --database default --table +``` + +Starting with Hudi 0.5.1 version read optimized version of merge-on-read tables are suffixed '_ro' by default. For backwards compatibility with older Hudi versions, +an optional HiveSyncConfig - `--skip-ro-suffix`, has been provided to turn off '_ro' suffixing if desired. Explore other hive sync options using the following command: ```java cd hudi-hive ./run_sync_tool.sh [hudi-hive]$ ./run_sync_tool.sh --help -Usage: [options] - Options: - * --base-path - Basepath of Hudi dataset to sync - * --database - name of the target database in Hive ---help, -h - Default: false - * --jdbc-url - Hive jdbc connect url - * --use-jdbc - Whether to use jdbc connection or hive metastore (via thrift) - * --pass - Hive password - * --table - name of the target table in Hive - * --user - Hive username ``` ## Deletes -Hudi supports implementing two types of deletes on data stored in Hudi datasets, by enabling the user to specify a different record payload implementation. +Hudi supports implementing two types of deletes on data stored in Hudi tables, by enabling the user to specify a different record payload implementation. Review comment: lets link to the delete blog from here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368783925 ## File path: docs/_docs/2_1_concepts.md ## @@ -1,37 +1,37 @@ --- title: "Concepts" -keywords: hudi, design, storage, views, timeline +keywords: hudi, design, table, queries, timeline permalink: /docs/concepts.html summary: "Here we introduce some basic concepts & give a broad technical overview of Hudi" toc: true last_modified_at: 2019-12-30T15:59:57-04:00 --- -Apache Hudi (pronounced “Hudi”) provides the following streaming primitives over datasets on DFS +Apache Hudi (pronounced “Hudi”) provides the following streaming primitives over hadoop compatible storages - * Upsert (how do I change the dataset?) - * Incremental pull (how do I fetch data that changed?) + * Update/Delete Records (how do I change records in a table?) + * Change Streams (how do I fetch data that changed?) Review comment: how do I fetch `records` that changed ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368783760 ## File path: docs/_docs/1_3_use_cases.md ## @@ -20,7 +20,7 @@ or [complicated handcrafted merge workflows](http://hortonworks.com/blog/four-st For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / [Voldemort](http://www.project-voldemort.com/voldemort/) / [HBase](https://hbase.apache.org/), even moderately big installations store billions of rows. It goes without saying that __full bulk loads are simply infeasible__ and more efficient approaches are needed if ingestion is to keep up with the typically high update volumes. -Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps __enforces a minimum file size on HDFS__, which improves NameNode health by solving one of the [age old problems in Hadoop land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a holistic way. This is all the more important for event streams, since typically its higher volume (eg: click streams) and if not managed well, can cause serious damage to your Hadoop cluster. +Even for immutable data sources like [Kafka](http://kafka.apache.org) , Hudi helps __enforces a minimum file size on HDFS__, which improves NameNode health by solving one of the [age old problems in Hadoop land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a holistic way. This is all the more important for event streams, since typically its higher volume (eg: click streams) and if not managed well, can cause serious damage to your Hadoop cluster. Review comment: ah. good catch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368784404 ## File path: docs/_docs/2_3_querying_data.md ## @@ -1,47 +1,52 @@ --- -title: Querying Hudi Datasets +title: Querying Hudi Tables keywords: hudi, hive, spark, sql, presto permalink: /docs/querying_data.html summary: In this page, we go over how to enable SQL queries on Hudi built tables. toc: true last_modified_at: 2019-12-30T15:59:57-04:00 --- -Conceptually, Hudi stores data physically once on DFS, while providing 3 logical views on top, as explained [before](/docs/concepts.html#views). -Once the dataset is synced to the Hive metastore, it provides external Hive tables backed by Hudi's custom inputformats. Once the proper hudi -bundle has been provided, the dataset can be queried by popular query engines like Hive, Spark and Presto. +Conceptually, Hudi stores data physically once on DFS, while providing 3 different ways of querying, as explained [before](/docs/concepts.html#query-types). +Once the table is synced to the Hive metastore, it provides external Hive tables backed by Hudi's custom inputformats. Once the proper hudi +bundle has been provided, the table can be queried by popular query engines like Hive, Spark and Presto. -Specifically, there are two Hive tables named off [table name](/docs/configurations.html#TABLE_NAME_OPT_KEY) passed during write. -For e.g, if `table name = hudi_tbl`, then we get +Specifically, following Hive tables are registered based off [table name](/docs/configurations.html#TABLE_NAME_OPT_KEY) +and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) passed during write. - - `hudi_tbl` realizes the read optimized view of the dataset backed by `HoodieParquetInputFormat`, exposing purely columnar data. - - `hudi_tbl_rt` realizes the real time view of the dataset backed by `HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data. +If `table name = hudi_trips` and `table type = COPY_ON_WRITE`, then we get: + - `hudi_trips` supports snapshot querying and incremental querying of the table backed by `HoodieParquetInputFormat`, exposing purely columnar data. + + +If `table name = hudi_trips` and `table type = MERGE_ON_READ`, then we get: + - `hudi_trips_rt` supports snapshot querying and incremental querying (providing near-real time data) of the table backed by `HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data. + - `hudi_trips_ro` supports read optimized querying of the table backed by `HoodieParquetInputFormat`, exposing purely columnar data. + As discussed in the concepts section, the one key primitive needed for [incrementally processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop), -is `incremental pulls` (to obtain a change stream/log from a dataset). Hudi datasets can be pulled incrementally, which means you can get ALL and ONLY the updated & new rows +is `incremental pulls` (to obtain a change stream/log from a table). Hudi tables can be pulled incrementally, which means you can get ALL and ONLY the updated & new rows since a specified instant time. This, together with upserts, are particularly useful for building data pipelines where 1 or more source Hudi tables are incrementally pulled (streams/facts), -joined with other tables (datasets/dimensions), to [write out deltas](/docs/writing_data.html) to a target Hudi dataset. Incremental view is realized by querying one of the tables above, -with special configurations that indicates to query planning that only incremental data needs to be fetched out of the dataset. +joined with other tables (tables/dimensions), to [write out deltas](/docs/writing_data.html) to a target Hudi table. Incremental view is realized by querying one of the tables above, +with special configurations that indicates to query planning that only incremental data needs to be fetched out of the table. -In sections, below we will discuss in detail how to access all the 3 views on each query engine. +In sections, below we will discuss how to access these query types from different query engines. ## Hive -In order for Hive to recognize Hudi datasets and query correctly, the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` +In order for Hive to recognize Hudi tables and query correctly, the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` in its [aux jars path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr). This will ensure the input format classes with its dependencies are available for query planning & execution. -### Read Optimized table +### Read optimized querying In addition to setup above, for beeline cli access, the
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368783683 ## File path: docs/_docs/1_2_structure.md ## @@ -6,16 +6,16 @@ summary: "Hudi brings stream processing to big data, providing fresh data while last_modified_at: 2019-12-30T15:59:57-04:00 --- -Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical datasets over DFS ([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) or cloud stores) and provides three logical views for query access. +Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical tables over DFS ([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) or cloud stores) and provides three types of querying. - * **Read Optimized View** - Provides excellent query performance on pure columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables. - * **Incremental View** - Provides a change stream out of the dataset to feed downstream jobs/ETLs. - * **Near-Real time Table** - Provides queries on real-time data, using a combination of columnar & row based storage (e.g Parquet + [Avro](http://avro.apache.org/docs/current/mr.html)) + * **Read Optimized querying** - Provides excellent query performance on pure columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables. Review comment: just `Query` and not `querying`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki
vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368784018 ## File path: docs/_docs/2_1_concepts.md ## @@ -53,69 +53,70 @@ With the help of the timeline, an incremental query attempting to get all new da only the changed files without say scanning all the time buckets > 07:00. ## File management -Hudi organizes a datasets into a directory structure under a `basepath` on DFS. Dataset is broken up into partitions, which are folders containing data files for that partition, +Hudi organizes a table into a directory structure under a `basepath` on DFS. Table is broken up into partitions, which are folders containing data files for that partition, very similar to Hive tables. Each partition is uniquely identified by its `partitionpath`, which is relative to the basepath. Within each partition, files are organized into `file groups`, uniquely identified by a `file id`. Each file group contains several -`file slices`, where each slice contains a base columnar file (`*.parquet`) produced at a certain commit/compaction instant time, +`file slices`, where each slice contains a base file (`*.parquet`) produced at a certain commit/compaction instant time, along with set of log files (`*.log.*`) that contain inserts/updates to the base file since the base file was produced. Hudi adopts a MVCC design, where compaction action merges logs and base files to produce new file slices and cleaning action gets rid of unused/older file slices to reclaim space on DFS. -Hudi provides efficient upserts, by mapping a given hoodie key (record key + partition path) consistently to a file group, via an indexing mechanism. +## Index +Hudi provides efficient upserts, by mapping a given hoodie key (record key + partition path) consistently to a file id, via an indexing mechanism. This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file. In short, the mapped file group contains all versions of a group of records. -## Storage Types & Views -Hudi storage types define how data is indexed & laid out on the DFS and how the above primitives and timeline activities are implemented on top of such organization (i.e how data is written). -In turn, `views` define how the underlying data is exposed to the queries (i.e how data is read). +## Table Types & Querying Review comment: and Queries (instead of Querying)? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services