HAWQ-1289 - add hive profile example with partitioning and mixed data types (closes #90)
Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/d0745d7e Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/d0745d7e Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/d0745d7e Branch: refs/heads/release/2.1.0.0-incubating Commit: d0745d7e7924287a7ef57a30b83adcc60f1a6655 Parents: 7a7a87a Author: Lisa Owen <[email protected]> Authored: Tue Feb 21 16:39:19 2017 -0800 Committer: David Yozie <[email protected]> Committed: Tue Feb 21 16:39:19 2017 -0800 ---------------------------------------------------------------------- markdown/pxf/HivePXF.html.md.erb | 278 ++++++++++++++++++++++++++-------- 1 file changed, 216 insertions(+), 62 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/d0745d7e/markdown/pxf/HivePXF.html.md.erb ---------------------------------------------------------------------- diff --git a/markdown/pxf/HivePXF.html.md.erb b/markdown/pxf/HivePXF.html.md.erb index 51b317c..6101016 100644 --- a/markdown/pxf/HivePXF.html.md.erb +++ b/markdown/pxf/HivePXF.html.md.erb @@ -155,6 +155,8 @@ Create a Hive table to expose our sample data set. hive> LOAD DATA LOCAL INPATH '/tmp/pxf_hive_datafile.txt' INTO TABLE sales_info; ``` + + In examples later in this section, you will access the `sales_info` Hive table directly via PXF. You will also insert `sales_info` data into tables of other Hive file format types, and use PXF to access those directly as well. 3. Perform a query on `sales_info` to verify that the data was loaded successfully: @@ -162,7 +164,21 @@ Create a Hive table to expose our sample data set. hive> SELECT * FROM sales_info; ``` -In examples later in this section, you will access the `sales_info` Hive table directly via PXF. You will also insert `sales_info` data into tables of other Hive file format types, and use PXF to access those directly as well. +### <a id="get_hdfs_file_location"></a>Determine the HDFS location of a Hive Table + +Should you need to identify the HDFS file location of a Hive managed table, reference it using its HDFS file path. You can determine a Hive table's location in HDFS using the `DESCRIBE` command, for example: + +``` sql +hive> DESCRIBE EXTENDED sales_info; +Detailed Table Information +... +location:hdfs://<namenode>:<port>/apps/hive/warehouse/sales_info +... +``` + +The `location` value identifies the HDFS file path of the table. + + ## <a id="hcatalog"></a>Using PXF and HCatalog to Query Hive @@ -240,7 +256,7 @@ To obtain a description of a Hive table with HCatalog integration, you can use t ``` ``` sql - postgres=# \d+ hcatalog.default.sales_info_rcfile; + postgres=# \d+ hcatalog.default.sales_info; ``` ``` shell @@ -327,7 +343,7 @@ Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](.. ## <a id="profile_hive"></a>Hive Profile -Use the `Hive` profile with any Hive file storage format. With the `Hive` profile, you can also access heterogenous format data in a single table where each partition may be stored in a different file format. In both cases, the `Hive` profile will use the optimal `Hive*` profile for the underlying file format type. +Use the `Hive` profile with any Hive file storage format. With the `Hive` profile, you can also access heterogenous format data in a single table where each partition may be stored in a different file format. In both cases, the `Hive` profile will use the optimal `Hive*` profile for the underlying file storage type. Refer to the [Partition Filtering](#partitionfiltering) discussion later in this topic for additional information on partitioning and the `Hive` profile. ### <a id="profile_hive_using"></a>Example: Using the Hive Profile @@ -361,7 +377,7 @@ Use the `Hive` profile to create a queryable HAWQ external table from the Hive ` Use the `HiveText` profile to query text format files. -**Note**: When using the `HiveText` profile, you *must* specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses. +**Note**: When using the `HiveText` profile, you **must** specify a delimiter option in both the `LOCATION` and `FORMAT` clauses. ### <a id="profile_hivetext_using"></a>Example: Using the HiveText Profile @@ -399,7 +415,7 @@ Use the PXF `HiveText` profile to create a queryable HAWQ external table from th The RCFile Hive format is used for row columnar formatted data. The `HiveRC` profile provides access to RCFile data. -**Note**: When using the `HiveRC` profile, you *must* specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses. +**Note**: When using the `HiveRC` profile, you **must** specify a delimiter option in both the `LOCATION` and `FORMAT` clauses. ### <a id="profile_hiverc_rcfiletbl_using"></a>Example: Using the HiveRC Profile @@ -432,7 +448,7 @@ Use the `HiveRC` profile to query RCFile-formatted data in Hive tables. hive> SELECT * FROM sales_info_rcfile; ``` -4. Use the PXF `HiveRC` profile to create a queryable HAWQ external table from the Hive `sales_info_rcfile` table created in the previous step. When using the `HiveRC` profile, you **must** specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.: +4. Use the PXF `HiveRC` profile to create a queryable HAWQ external table from the Hive `sales_info_rcfile` table created in the previous step. You *must* specify a delimiter option in both the `LOCATION` and `FORMAT` clauses.: ``` sql postgres=# CREATE EXTERNAL TABLE salesinfo_hivercprofile(location text, month text, num_orders int, total_sales float8) @@ -611,76 +627,214 @@ postgres=# SHOW pxf_enable_filter_pushdown; postgres=# SET pxf_enable_filter_pushdown=off; ``` -### <a id="example2"></a>Create Partitioned Hive Table +### <a id="example_hive_part"></a>Example: Using the Hive Profile to Access Partitioned Homogenous Data -Create a Hive table `sales_part` with two partition columns, `delivery_state` and `delivery_city:` +In this example, you will use the `Hive` profile to query a Hive table named `sales_part` you partition on `delivery_state` and `delivery_city` fields. You will then create a HAWQ external table to query `sales_part`, including specific examples illustrating filter pushdown. -``` sql -hive> CREATE TABLE sales_part (name string, type string, supplier_key int, price double) - PARTITIONED BY (delivery_state string, delivery_city string) - ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; -``` +1. Create a Hive table named `sales_part` with two partition columns, `delivery_state` and `delivery_city:` -Load data into this Hive table and add some partitions: + ``` sql + hive> CREATE TABLE sales_part (name string, type string, supplier_key int, price double) + PARTITIONED BY (delivery_state string, delivery_city string) + ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; + ``` -``` sql -hive> INSERT INTO TABLE sales_part - PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Fresno') - VALUES ('block', 'widget', 33, 15.17); -hive> INSERT INTO TABLE sales_part - PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Sacramento') - VALUES ('cube', 'widget', 11, 1.17); -hive> INSERT INTO TABLE sales_part - PARTITION(delivery_state = 'NEVADA', delivery_city = 'Reno') - VALUES ('dowel', 'widget', 51, 31.82); -hive> INSERT INTO TABLE sales_part - PARTITION(delivery_state = 'NEVADA', delivery_city = 'Las Vegas') - VALUES ('px49', 'pipe', 52, 99.82); -``` +2. Load data into this Hive table and add some partitions: -The Hive storage directory structure for the `sales_part` table appears as follows: + ``` sql + hive> INSERT INTO TABLE sales_part + PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Fresno') + VALUES ('block', 'widget', 33, 15.17); + hive> INSERT INTO TABLE sales_part + PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Sacramento') + VALUES ('cube', 'widget', 11, 1.17); + hive> INSERT INTO TABLE sales_part + PARTITION(delivery_state = 'NEVADA', delivery_city = 'Reno') + VALUES ('dowel', 'widget', 51, 31.82); + hive> INSERT INTO TABLE sales_part + PARTITION(delivery_state = 'NEVADA', delivery_city = 'Las Vegas') + VALUES ('px49', 'pipe', 52, 99.82); + ``` + +3. Query the `sales_part` table: -``` pre -$ sudo -u hdfs hdfs dfs -ls -R /apps/hive/warehouse/sales_part -/apps/hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=âFresnoâ/ -/apps/hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=Sacramento/ -/apps/hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=Reno/ -/apps/hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=âLas Vegasâ/ -``` + ``` sql + hive> SELECT * FROM sales_part; + ``` + + A `SELECT *` statement on a Hive partitioned table shows the partition fields at the end of the record. -To define a HAWQ PXF table that will read this Hive table and take advantage of partition filter push-down, define the fields corresponding to the Hive partition fields at the end of the `CREATE EXTERNAL TABLE` attribute list. In HiveQL, a `SELECT *` statement on a partitioned table shows the partition fields at the end of the record. +3. Examine the Hive/HDFS directory structure for the `sales_part` table: -``` sql -postgres=# CREATE EXTERNAL TABLE pxf_sales_part( - item_name TEXT, - item_type TEXT, - supplier_key INTEGER, - item_price DOUBLE PRECISION, - delivery_state TEXT, - delivery_city TEXT -) -LOCATION ('pxf://namenode:51200/sales_part?Profile=Hive') -FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); - -postgres=# SELECT * FROM pxf_sales_part; -``` + ``` shell + $ sudo -u hdfs hdfs dfs -ls -R /apps/hive/warehouse/sales_part + /apps/hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=Fresno/ + /apps/hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=Sacramento/ + /apps/hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=Reno/ + /apps/hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=Las Vegas/ + ``` -### <a id="example3"></a>Query Without Pushdown +4. Create a PXF external table to read the partitioned `sales_part` Hive table. To take advantage of partition filter push-down, define fields corresponding to the Hive partition fields at the end of the `CREATE EXTERNAL TABLE` attribute list. -In the following example, the HAWQ query filters the `delivery_city` partition `Sacramento`. The filter on `item_name` is not pushed down, since it is not a partition column. It is performed on the HAWQ side after all the data on `Sacramento` is transferred for processing. + ``` shell + $ psql -d postgres + ``` -``` sql -postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'cube'; -``` + ``` sql + postgres=# CREATE EXTERNAL TABLE pxf_sales_part( + item_name TEXT, item_type TEXT, + supplier_key INTEGER, item_price DOUBLE PRECISION, + delivery_state TEXT, delivery_city TEXT) + LOCATION ('pxf://namenode:51200/sales_part?Profile=Hive') + FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); + ``` + +5. Query the table: + + ``` sql + postgres=# SELECT * FROM pxf_sales_part; + ``` -### <a id="example4"></a>Query With Pushdown +6. Perform another query (no pushdown) on `pxf_sales_part` to return records where the `delivery_city` is `Sacramento` and `item_name` is `cube` -The following HAWQ query reads all the data under `delivery_state` partition `CALIFORNIA`, regardless of the city. + ``` sql + postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'cube'; + ``` -``` sql -postgres=# SET pxf_enable_filter_pushdown=on; -postgres=# SELECT * FROM pxf_sales_part WHERE delivery_state = 'CALIFORNIA'; -``` + The query filters the `delivery_city` partition `Sacramento`. The filter on `item_name` is not pushed down, since it is not a partition column. It is performed on the HAWQ side after all the data in the `Sacramento` partition is transferred for processing. + +7. Query (with pushdown) for all records where `delivery_state` is `CALIFORNIA`: + + ``` sql + postgres=# SET pxf_enable_filter_pushdown=on; + postgres=# SELECT * FROM pxf_sales_part WHERE delivery_state = 'CALIFORNIA'; + ``` + + This query reads all of the data in the `CALIFORNIA` `delivery_state` partition, regardless of the city. + +### <a id="example_hive_part_multi"></a>Example: Using the Hive Profile to Access Partitioned Heterogenous Data + +The `Hive` profile supports multiple data format types. This support enables you to query a partitioned Hive table that may be composed of data of different formats. + +In this example, you will use the `Hive` profile both directly and indirectly via PXF HCatalog integration to query a partitioned Hive external table. The table is composed of the HDFS data files associated with the `sales_info` (text format) and `sales_info_rcfile` (RC format) Hive tables you created in previous exercises. You will partition the data by year, assigning the data from `sales_info` to the year 2013, and the data from `sales_info_rcfile` to the year 2016. (Ignore at the moment the fact that the tables contain the same data.) + +1. Create a Hive external table named `hive_multiformpart` that is partitioned by a string field named `year`: + + ``` shell + $ HADOOP_USER_NAME=hdfs hive + ``` + + ``` sql + hive> CREATE EXTERNAL TABLE hive_multiformpart( location string, month string, number_of_orders int, total_sales double) + PARTITIONED BY( year string ) + ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; + ``` + +2. Describe the `sales_info` and `sales_info_rcfile` tables, making note of the HDFS file `location`s: + + ``` sql + hive> DESCRIBE EXTENDED sales_info; + hive> DESCRIBE EXTENDED sales_info_rcfile; + ``` + +3. Create partitions in the `hive_multiformpart` table for the HDFS locations associated with each of the `sales_info` and `sales_info_rcfile` tables: + + ``` sql + hive> ALTER TABLE hive_multiformpart ADD PARTITION (year = '2013') LOCATION 'hdfs://namenode:8020/apps/hive/warehouse/sales_info'; + hive> ALTER TABLE hive_multiformpart ADD PARTITION (year = '2016') LOCATION 'hdfs://namenode:8020/apps/hive/warehouse/sales_info_rcfile'; + ``` + +4. Explicitly identify the file format of the partition associated with the `sales_info_rcfile` table: + + ``` sql + hive> ALTER TABLE hive_multiformpart PARTITION (year='2016') SET FILEFORMAT RCFILE; + ``` + + You need not specify the file format of the partition associated with the `sales_info` table, as `TEXTFILE` format is the default. + +5. Query the `hive_multiformpart` table: + + ``` sql + hive> SELECT * from hive_multiformpart; + ... + Bangalore Jul 271 8320.55 2016 + Beijing Dec 100 4248.41 2016 + Prague Jan 101 4875.33 2013 + Rome Mar 87 1557.39 2013 + ... + hive> SELECT * from hive_multiformpart WHERE year='2013'; + hive> SELECT * from hive_multiformpart WHERE year='2016'; + ``` + +6. Show the partitions defined for the `hive_multiformpart` table and exit `hive`: + + ``` sql + hive> SHOW PARTITIONS hive_multiformpart; + year=2013 + year=2016 + hive> quit; + ``` + +7. Start the `psql` subsystem: + + ``` shell + $ psql -d postgres + ``` + +7. Use PXF HCatalog integration to query the Hive `hive_multiformpart` external table you created in the previous steps: + + ``` sql + postgres=# SELECT * FROM hcatalog.default.hive_multiformpart; + ``` + + ``` shell + location | month | number_of_orders | total_sales | year + ---------------+-------+------------------+-------------+------ + ... + Prague | Dec | 333 | 9894.77 | 2013 + Bangalore | Jul | 271 | 8320.55 | 2013 + Beijing | Dec | 100 | 4248.41 | 2013 + Prague | Jan | 101 | 4875.33 | 2016 + Rome | Mar | 87 | 1557.39 | 2016 + Bangalore | May | 317 | 8936.99 | 2016 + ... + ``` + +7. Use the PXF `Hive` profile to create a readable HAWQ external table derived from the Hive `hive_multiformpart` external table you created in the previous steps: + + ``` sql + postgres=# CREATE EXTERNAL TABLE pxf_multiformpart(location text, month text, num_orders int, total_sales float8, year text) + LOCATION ('pxf://namenode:51200/default.hive_multiformpart?PROFILE=Hive') + FORMAT 'CUSTOM' (formatter='pxfwritable_import'); + ``` + +5. Query the PXF external table: + + ``` sql + postgres=# SELECT * FROM pxf_multiformpart; + ``` + + ``` shell + location | month | num_orders | total_sales | year + ---------------+-------+------------+-------------+-------- + .... + Prague | Dec | 333 | 9894.77 | 2013 + Bangalore | Jul | 271 | 8320.55 | 2013 + Beijing | Dec | 100 | 4248.41 | 2013 + Prague | Jan | 101 | 4875.33 | 2016 + Rome | Mar | 87 | 1557.39 | 2016 + Bangalore | May | 317 | 8936.99 | 2016 + .... + ``` + +6. Perform a second query to calculate the total number of orders for the year 2013: + + ``` sql + postgres=# SELECT sum(num_orders) FROM pxf_multiformpart WHERE month='Dec' AND year='2013'; + sum + ----- + 433 + ``` ## <a id="topic_fdm_zrh_1s"></a>Using PXF with Hive Default Partitions
