incubator-hawq-docs git commit: HAWQ-1394 document HiveORC profile (closes #104)

yozie Mon, 20 Mar 2017 14:20:07 -0700

Repository: incubator-hawq-docs
Updated Branches:
  refs/heads/develop ce67adabd -> 8c37d043a



HAWQ-1394 document HiveORC profile (closes #104)


Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/8c37d043
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/8c37d043
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/8c37d043

Branch: refs/heads/develop
Commit: 8c37d043ab8b6ef59315a7bd92acb2c9b45dcedb
Parents: ce67ada
Author: Lisa Owen <[email protected]>
Authored: Mon Mar 20 14:19:45 2017 -0700
Committer: David Yozie <[email protected]>
Committed: Mon Mar 20 14:19:45 2017 -0700

----------------------------------------------------------------------
 markdown/pxf/HivePXF.html.md.erb | 183 ++++++++++++++++++++++++++++++++--
 1 file changed, 175 insertions(+), 8 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/8c37d043/markdown/pxf/HivePXF.html.md.erb
----------------------------------------------------------------------
diff --git a/markdown/pxf/HivePXF.html.md.erb b/markdown/pxf/HivePXF.html.md.erb
index bc4e9f6..345bbca 100644
--- a/markdown/pxf/HivePXF.html.md.erb
+++ b/markdown/pxf/HivePXF.html.md.erb
@@ -50,7 +50,7 @@ The PXF Hive plug-in supports several file formats and 
profiles for accessing th
 | TextFile | Flat file with data in comma-, tab-, or space-separated value 
format or JSON notation. | Hive, HiveText |
 | SequenceFile | Flat file consisting of binary key/value pairs. | Hive |
 | RCFile | Record columnar data consisting of binary key/value pairs; high row 
compression rate. | Hive, HiveRC |
-| ORCFile | Optimized row columnar data with stripe, footer, and postscript 
sections; reduces data size. | Hive |
+| ORCFile | Optimized row columnar data with stripe, footer, and postscript 
sections; reduces data size. | Hive, HiveORC |
 | Parquet | Compressed columnar data representation. | Hive |
 | Avro | JSON-defined, schema-based data serialization format. | Hive |
 
@@ -82,7 +82,7 @@ The following table summarizes external mapping rules for 
Hive primitive types.
 
 Hive supports complex data types including array, struct, map, and union. PXF 
maps each of these complex types to `text`.  While HAWQ does not natively 
support these types, you can create HAWQ functions or application code to 
extract subcomponents of these complex data types.
 
-An example using complex data types with the `Hive` profile is provided later 
in this topic.
+Examples using complex data types with the `Hive` and `HiveORC` profiles are 
provided later in this topic.
 
 
 ## <a id="hive_sampledataset"></a>Sample Data Set
@@ -316,7 +316,7 @@ HCatalog integration has the following limitations:
 
 In the previous section, you used HCatalog integration to query a Hive table. 
You can also create a PXF/HAWQ external table to access Hive table data. This 
Hive table access mechanism requires that you identify an appropriate Hive 
profile.
 
-The PXF Hive plug-in supports several Hive-related profiles. These include 
`Hive`, `HiveText`, and `HiveRC`. The `HiveText` and `HiveRC` profiles are 
specifically optimized for text and RC file formats, respectively. The `Hive` 
profile is optimized for all file storage types; use the `Hive` profile when 
the underlying Hive table is composed of multiple partitions with differing 
file formats.
+The PXF Hive plug-in supports several Hive-related profiles. These include 
`Hive`, `HiveText`, and `HiveRC`, and `HiveORC`. The `HiveText` and `HiveRC` 
profiles are specifically optimized for text and RC file formats, respectively. 
The `HiveORC` profile is optimized for ORC file formats. The `Hive` profile is 
optimized for all file storage types; use the `Hive` profile when the 
underlying Hive table is composed of multiple partitions with differing file 
formats.
 
 Use the following syntax to create a HAWQ external table representing Hive 
data:
 
@@ -324,7 +324,7 @@ Use the following syntax to create a HAWQ external table 
representing Hive data:
 CREATE EXTERNAL TABLE <table_name>
     ( <column_name> <data_type> [, ...] | LIKE <other_table> )
 LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name>
-    ?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=<delim>'])
+    ?PROFILE=Hive|HiveText|HiveRC|HiveORC[&DELIMITER=<delim>'])
 FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')
 ```
 
@@ -336,9 +336,9 @@ Hive-plug-in-specific keywords and values used in the 
[CREATE EXTERNAL TABLE](..
 | \<port\>    | The PXF port. If \<port\> is omitted, PXF assumes \<host\> 
identifies a High Availability HDFS Nameservice and connects to the port number 
designated by the `pxf_service_port` server configuration parameter value. 
Default is 51200. |
 | \<hive-db-name\>    | The name of the Hive database. If omitted, defaults to 
the Hive database named `default`. |
 | \<hive-table-name\>    | The name of the Hive table. |
-| PROFILE    | The `PROFILE` keyword must specify one of the values `Hive`, 
`HiveText`, or `HiveRC`. |
+| PROFILE    | The `PROFILE` keyword must specify one of the values `Hive`, 
`HiveText`, `HiveRC`, or `HiveORC`. |
 | DELIMITER    | The `DELIMITER` clause is required for both the `HiveText` 
and `HiveRC` profiles and identifies the field delimiter used in the Hive data 
set.  \<delim\> must be a single ascii character or specified in hexadecimal 
representation. |
-| FORMAT (`Hive` profile)   | The `FORMAT` clause must specify `CUSTOM`. The 
`CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`.   |
+| FORMAT (`Hive` and `HiveORC` profiles)   | The `FORMAT` clause must specify 
`CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` 
`formatter`.   |
 | FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must specify 
`TEXT`. The `delimiter` must be specified a second time in '\<delim\>'. |
 
 
@@ -401,7 +401,7 @@ Use the PXF `HiveText` profile to create a queryable HAWQ 
external table from th
 2. Query the external table:
 
     ``` sql
-    postgres=# SELECT * FROM salesinfo_hivetextprofile WHERE 
location="Beijing";
+    postgres=# SELECT * FROM salesinfo_hivetextprofile WHERE 
location='Beijing';
     ```
 
     ``` shell
@@ -475,6 +475,96 @@ Use the `HiveRC` profile to query RCFile-formatted data in 
Hive tables.
      ...
     ```
 
+## <a id="hiveorc-intro"></a>HiveORC Profile
+
+The Optimized Row Columnar (ORC) file format is a columnar file format that 
provides a highly efficient way to both store and access HDFS data. ORC format 
offers improvements over text and RCFile formats in terms of both compression 
and performance. HAWQ/PXF supports ORC version 1.2.1.
+
+ORC is type-aware and specifically designed for Hadoop workloads. ORC files 
store both the type of and encoding information for the data in the file. All 
columns within a single group of row data (also known as stripe) are stored 
together on disk in ORC format files. The columnar nature of the ORC format 
type enables read projection, helping avoid accessing unecessary columns during 
a query.
+
+ORC also supports predicate pushdown with built-in indexes at the file, 
stripe, and row levels, moving the filter operation to the data loading phase.
+
+Refer to the [Apache orc](https://orc.apache.org/docs/) and the Apache Hive 
[LanguageManual 
ORC](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC) 
websites for detailed information about the ORC file format.
+
+Use the `HiveORC` profile to access ORC format data. The `HiveORC` profile 
provides:
+
+- Enhanced query performance - Column projection information is leveraged to 
enhance query performance by reducing disk I/O and data payload.
+
+- Optimized predicate pushdown - Predicate pushdown is optimized for:
+    - `int2`, `int4`, `int8`, `float8`, `text`, `bpchar`, and `boolean` data 
type and `=`, `>`, `<`, `>=`, `<=`, `IS NULL`, and `IS NOT NULL` operator 
combinations
+    - `=`, `>`, `<`, `>=`, `<=`, `IS NULL`, and `IS NOT NULL` operators and 
comparisons between the integer types
+    - `=`, `>`, `<`, `>=`, `<=`, `IS NULL`, and `IS NOT NULL` operators and 
comparisons between the `float8` and `float4` types
+    - `IN` operator on arrays of `int2`, `int4`, `int8`, `boolean`, and `text`
+
+- Complex type support - You can access Hive tables composed of array, map, 
struct, and union data types. PXF serializes each of these complex types to 
`text`.
+
+**Note**: The `HiveORC` profile currently supports access to data stored in 
ORC format only through a Hive mapped table.
+
+### <a id="using-hiveorc-profile"></a>Example: Using the HiveORC Profile
+
+In the following example, you will create a Hive table stored in ORC format 
and use the `HiveORC` profile to query this Hive table.
+
+1. Create a Hive table with ORC file format:
+
+    ``` shell
+    $ HADOOP_USER_NAME=hdfs hive
+    ```
+
+    ``` sql
+    hive> CREATE TABLE sales_info_ORC (location string, month string,
+            number_of_orders int, total_sales double)
+          STORED AS ORC;
+    ```
+
+2. Insert the data from the `sales_info` table into `sales_info_ORC`:
+
+    ``` sql
+    hive> INSERT INTO TABLE sales_info_ORC SELECT * FROM sales_info;
+    ```
+
+    A copy of the sample data set is now stored in ORC format in 
`sales_info_ORC`.
+
+3. Perform a Hive query on `sales_info_ORC` to verify that the data was loaded 
successfully:
+
+    ``` sql
+    hive> SELECT * FROM sales_info_ORC;
+    ```
+
+4. Start the `psql` subsystem and turn on timing:
+
+    ``` shell
+    $ psql -d postgres
+    ```
+
+    ``` sql
+    postgres=> \timing
+    Timing is on.
+    ```
+
+4. Use the PXF `HiveORC` profile to create a queryable HAWQ external table 
from the Hive table named `sales_info_ORC` you created in Step 1. The `FORMAT` 
clause must specify `'CUSTOM'`. The `HiveORC` `CUSTOM` format supports only the 
built-in `'pxfwritable_import'` `formatter`.
+
+    ``` sql
+    postgres=> CREATE EXTERNAL TABLE salesinfo_hiveORCprofile(location text, 
month text, num_orders int, total_sales float8)
+                 LOCATION 
('pxf://namenode:51200/default.sales_info_ORC?PROFILE=HiveORC')
+                 FORMAT 'CUSTOM' (formatter='pxfwritable_import');
+    ```
+
+5. Query the external table:
+
+    ``` sql
+    postgres=> SELECT * FROM salesinfo_hiveORCprofile;
+    ```
+
+    ``` pre
+       location    | month | number_of_orders | total_sales 
+    ---------------+-------+------------------+-------------
+     Prague        | Jan   |              101 |     4875.33
+     Rome          | Mar   |               87 |     1557.39
+     Bangalore     | May   |              317 |     8936.99
+     ...
+
+    Time: 425.416 ms
+    ```
+
 ## <a id="topic_dbb_nz3_ts"></a>Accessing Parquet-Format Hive Tables
 
 The PXF `Hive` profile supports both non-partitioned and partitioned Hive 
tables that use the Parquet storage format in HDFS. Simply map the table 
columns using equivalent HAWQ data types. For example, if a Hive table is 
created using:
@@ -498,8 +588,9 @@ And query the HAWQ external table using:
 postgres=# SELECT fname,lname FROM pxf_parquet_table;
 ```
 
+## <a id="complex_dt_example"></a> Complex Data Types
 
-## <a id="complex_dt_example"></a>Hive Profile Complex Data Type Example
+### <a id="complex_dt_example"></a>Example: Using the Hive Profile with 
Complex Data Types
 
 This example employs the `Hive` profile and the array and map complex types, 
specifically an array of integers and a string key/value pair map.
 
@@ -607,6 +698,82 @@ When specifying an array field in a Hive table, you must 
identify the terminator
 
     `intarray` and `propmap` are each serialized as text strings.
 
+### <a id="using-hiveorc-profile-complex"></a>Example: Using the HiveORC 
Profile with Complex Data Types
+
+In the following example, you will create a Hive table stored in ORC format. 
You will use the `HiveORC` profile to query the complex types in the 
`table_complextypes` Hive table you created in the previous exercise.
+
+1. Create a Hive table with ORC file format:
+
+    ``` shell
+    $ HADOOP_USER_NAME=hdfs hive
+    ```
+
+    ``` sql
+    hive> CREATE TABLE table_complextypes_ORC( index int, name string, 
intarray ARRAY<int>, propmap MAP<string, string>)
+            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
+            COLLECTION ITEMS TERMINATED BY '%'
+            MAP KEYS TERMINATED BY ':'
+          STORED AS ORC;
+    ```
+
+2. Insert the data from the `table_complextypes` table into 
`table_complextypes_ORC`:
+
+    ``` sql
+    hive> INSERT INTO TABLE table_complextypes_ORC SELECT * FROM 
table_complextypes;
+    ```
+
+    A copy of the sample data set is now stored in ORC format in 
`table_complextypes_ORC`.
+
+3. Perform a Hive query on `table_complextypes_ORC` to verify that the data 
was loaded successfully:
+
+    ``` sql
+    hive> SELECT * FROM table_complextypes_ORC;
+    ```
+
+    ``` pre
+    OK
+    3       Prague       [1,2,3]    {"zone":"euro","status":"up"}
+    89      Rome         [4,5,6]    {"zone":"euro"}
+    400     Bangalore    [7,8,9]    {"zone":"apac","status":"pending"}
+    ...
+    ```
+
+4. Start the `psql` subsystem:
+
+    ``` shell
+    $ psql -d postgres
+    ```
+
+4. Use the PXF `HiveORC` profile to create a queryable HAWQ external table 
from the Hive table named `table_complextypes_ORC` you created in Step 1. The 
`FORMAT` clause must specify `'CUSTOM'`. The `HiveORC` `CUSTOM` format supports 
only the built-in `'pxfwritable_import'` `formatter`.
+
+    ``` sql
+    postgres=> CREATE EXTERNAL TABLE complextypes_hiveorc(index int, name 
text, intarray text, propmap text)
+               LOCATION 
('pxf://namenode:51200/default.table_complextypes_ORC?PROFILE=HiveORC')
+                 FORMAT 'CUSTOM' (formatter='pxfwritable_import');
+    ```
+
+    Notice that the integer array and map complex types are mapped to type 
text.
+
+5. Query the external table:
+
+    ``` sql
+    postgres=> SELECT * FROM complextypes_hiveorc;
+    ```
+
+    ``` pre
+     index |    name    | intarray |              propmap               
+    -------+------------+----------+------------------------------------
+         3 | Prague     | [1,2,3]  | {"zone":"euro","status":"up"}
+        89 | Rome       | [4,5,6]  | {"zone":"euro"}
+       400 | Bangalore  | [7,8,9]  | {"zone":"apac","status":"pending"}
+     ...
+
+    ```
+
+    `intarray` and `propmap` are each serialized as text strings.
+
+
+
 ## <a id="partitionfiltering"></a>Partition Filtering
 
 The PXF Hive plug-in supports the Hive partitioning feature and directory 
structure. This enables partition exclusion on selected HDFS files comprising 
the Hive table.Â To useÂ the partition filteringÂ feature to reduce network 
traffic and I/O, run a PXF query using a `WHERE` clauseÂ that refers to a 
specific partition in the partitioned Hive table.

incubator-hawq-docs git commit: HAWQ-1394 document HiveORC profile (closes #104)

Reply via email to