Repository: incubator-hawq-docs Updated Branches: refs/heads/master 40d9a0b6d -> abea80853
HAWQ-1491 - create usage docs for HiveVectorizedORC profile (closes #126) Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/abea8085 Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/abea8085 Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/abea8085 Branch: refs/heads/master Commit: abea808538dc29e6eb893efc37c27fc2f59f460f Parents: 40d9a0b Author: Lisa Owen <[email protected]> Authored: Wed Jun 28 16:35:49 2017 -0700 Committer: David Yozie <[email protected]> Committed: Wed Jun 28 16:35:49 2017 -0700 ---------------------------------------------------------------------- markdown/pxf/HivePXF.html.md.erb | 82 ++++++++++++++++---- .../PXFExternalTableandAPIReference.html.md.erb | 20 +++-- markdown/pxf/ReadWritePXF.html.md.erb | 12 +++ 3 files changed, 90 insertions(+), 24 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/abea8085/markdown/pxf/HivePXF.html.md.erb ---------------------------------------------------------------------- diff --git a/markdown/pxf/HivePXF.html.md.erb b/markdown/pxf/HivePXF.html.md.erb index a226537..c6c310e 100644 --- a/markdown/pxf/HivePXF.html.md.erb +++ b/markdown/pxf/HivePXF.html.md.erb @@ -50,7 +50,7 @@ The PXF Hive plug-in supports several file formats and profiles for accessing th | TextFile | Flat file with data in comma-, tab-, or space-separated value format or JSON notation. | Hive, HiveText | | SequenceFile | Flat file consisting of binary key/value pairs. | Hive | | RCFile | Record columnar data consisting of binary key/value pairs; high row compression rate. | Hive, HiveRC | -| ORCFile | Optimized row columnar data with stripe, footer, and postscript sections; reduces data size. | Hive, HiveORC | +| ORCFile | Optimized row columnar data with stripe, footer, and postscript sections; reduces data size. | Hive, HiveORC, HiveVectorizedORC | | Parquet | Compressed columnar data representation. | Hive | | Avro | JSON-defined, schema-based data serialization format. | Hive | @@ -78,12 +78,15 @@ The following table summarizes external mapping rules for Hive primitive types. | timestamp | timestamp | +**Note**: The `HiveVectorizedORC` profile does not support the timestamp data type. + ### <a id="topic_b4v_g3n_25"></a>Complex Data Types Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`. While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types. Examples using complex data types with the `Hive` and `HiveORC` profiles are provided later in this topic. +**Note**: The `HiveVectorizedORC` profile does not support complex types. ## <a id="hive_sampledataset"></a>Sample Data Set @@ -316,7 +319,7 @@ HCatalog integration has the following limitations: In the previous section, you used HCatalog integration to query a Hive table. You can also create a PXF/HAWQ external table to access Hive table data. This Hive table access mechanism requires that you identify an appropriate Hive profile. -The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`, and `HiveORC`. The `HiveText` and `HiveRC` profiles are specifically optimized for text and RC file formats, respectively. The `HiveORC` profile is optimized for ORC file formats. The `Hive` profile is optimized for all file storage types; use the `Hive` profile when the underlying Hive table is composed of multiple partitions with differing file formats. +The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`, `HiveORC`, and `HiveVectorizedORC`. The `HiveText` and `HiveRC` profiles are specifically optimized for text and RC file formats, respectively. The `HiveORC` and `HiveVectorizedORC` profiles are optimized for ORC file formats. The `Hive` profile is optimized for all file storage types; use the `Hive` profile when the underlying Hive table is composed of multiple partitions with differing file formats. Use the following syntax to create a HAWQ external table representing Hive data: @@ -324,7 +327,7 @@ Use the following syntax to create a HAWQ external table representing Hive data: CREATE EXTERNAL TABLE <table_name> ( <column_name> <data_type> [, ...] | LIKE <other_table> ) LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name> - ?PROFILE=Hive|HiveText|HiveRC|HiveORC[&DELIMITER=<delim>']) + ?PROFILE=Hive|HiveText|HiveRC|HiveORC|HiveVectorizedORC[&DELIMITER=<delim>']) FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>') ``` @@ -336,9 +339,9 @@ Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](.. | \<port\> | The PXF port. If \<port\> is omitted, PXF assumes \<host\> identifies a High Availability HDFS Nameservice and connects to the port number designated by the `pxf_service_port` server configuration parameter value. Default is 51200. | | \<hive-db-name\> | The name of the Hive database. If omitted, defaults to the Hive database named `default`. | | \<hive-table-name\> | The name of the Hive table. | -| PROFILE | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, `HiveRC`, or `HiveORC`. | +| PROFILE | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, `HiveRC`, `HiveORC`, or `HiveVectorizedORC`. | | DELIMITER | The `DELIMITER` clause is required for both the `HiveText` and `HiveRC` profiles and identifies the field delimiter used in the Hive data set. \<delim\> must be a single ascii character or specified in hexadecimal representation. | -| FORMAT (`Hive` and `HiveORC` profiles) | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`. | +| FORMAT (`Hive`, `HiveORC`, and `HiveVectorizedORC` profiles) | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`. | | FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must specify `TEXT`. The `delimiter` must be specified a second time in '\<delim\>'. | @@ -348,9 +351,9 @@ Use the `Hive` profile with any Hive file storage format. With the `Hive` profil ### <a id="profile_hive_using"></a>Example: Using the Hive Profile -Use the `Hive` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier. +Use the `Hive` profile to create a readable HAWQ external table from the Hive `sales_info` textfile format table created earlier. -1. Create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier: +1. Create a readable HAWQ external table from the Hive `sales_info` textfile format table created earlier: ``` sql postgres=# CREATE EXTERNAL TABLE salesinfo_hiveprofile(location text, month text, num_orders int, total_sales float8) @@ -382,7 +385,7 @@ Use the `HiveText` profile to query text format files. ### <a id="profile_hivetext_using"></a>Example: Using the HiveText Profile -Use the PXF `HiveText` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier. +Use the PXF `HiveText` profile to create a readable HAWQ external table from the Hive `sales_info` textfile format table created earlier. 1. Create the external table: @@ -449,7 +452,7 @@ Use the `HiveRC` profile to query RCFile-formatted data in Hive tables. hive> SELECT * FROM sales_info_rcfile; ``` -4. Use the PXF `HiveRC` profile to create a queryable HAWQ external table from the Hive `sales_info_rcfile` table created in the previous step. You *must* specify a delimiter option in both the `LOCATION` and `FORMAT` clauses.: +4. Use the PXF `HiveRC` profile to create a readable HAWQ external table from the Hive `sales_info_rcfile` table created in the previous step. You *must* specify a delimiter option in both the `LOCATION` and `FORMAT` clauses.: ``` sql postgres=# CREATE EXTERNAL TABLE salesinfo_hivercprofile(location text, month text, num_orders int, total_sales float8) @@ -475,7 +478,7 @@ Use the `HiveRC` profile to query RCFile-formatted data in Hive tables. ... ``` -## <a id="hiveorc-intro"></a>HiveORC Profile +## <a id="hiveorc-intro"></a>ORC File Format The Optimized Row Columnar (ORC) file format is a columnar file format that provides a highly efficient way to both store and access HDFS data. ORC format offers improvements over text and RCFile formats in terms of both compression and performance. HAWQ/PXF supports ORC version 1.2.1. @@ -485,7 +488,9 @@ ORC also supports predicate pushdown with built-in indexes at the file, stripe, Refer to the [Apache orc](https://orc.apache.org/docs/) and the Apache Hive [LanguageManual ORC](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC) websites for detailed information about the ORC file format. -Use the `HiveORC` profile to access ORC format data. The `HiveORC` profile provides: +### <a id="orc-profiles"></a>Profiles Supporting the ORC File Format + +Use the `HiveORC` or `HiveVectorizedORC` profiles to access ORC format data. These profiles provide: - Enhanced query performance - Column projection information is leveraged to enhance query performance by reducing disk I/O and data payload. @@ -495,9 +500,16 @@ Use the `HiveORC` profile to access ORC format data. The `HiveORC` profile provi - `=`, `>`, `<`, `>=`, `<=`, `IS NULL`, and `IS NOT NULL` operators and comparisons between the `float8` and `float4` types - `IN` operator on arrays of `int2`, `int4`, `int8`, `boolean`, and `text` -- Complex type support - You can access Hive tables composed of array, map, struct, and union data types. PXF serializes each of these complex types to `text`. +When choosing an ORC-supporting profile, consider the following: + +- The `HiveORC` profile supports complex types. You can access Hive tables composed of array, map, struct, and union data types. PXF serializes each of these complex types to `text`. + + The `HiveVectorizedORC` profile does not support complex types. + +- The `HiveVectorizedORC` profile reads up to 1024 rows of data at once, while the `HiveORC` profile reads only a single row at a time. -**Note**: The `HiveORC` profile currently supports access to data stored in ORC format only through a Hive mapped table. + +**Note**: The `HiveORC` and `HiveVectorizedORC` profiles currently support access to data stored in ORC format only through a Hive mapped table. ### <a id="using-hiveorc-profile"></a>Example: Using the HiveORC Profile @@ -540,7 +552,7 @@ In the following example, you will create a Hive table stored in ORC format and Timing is on. ``` -4. Use the PXF `HiveORC` profile to create a queryable HAWQ external table from the Hive table named `sales_info_ORC` you created in Step 1. The `FORMAT` clause must specify `'CUSTOM'`. The `HiveORC` `CUSTOM` format supports only the built-in `'pxfwritable_import'` `formatter`. +4. Use the PXF `HiveORC` profile to create a readable HAWQ external table from the Hive table named `sales_info_ORC` you created in Step 1. The `FORMAT` clause must specify `'CUSTOM'`. The `HiveORC` `CUSTOM` format supports only the built-in `'pxfwritable_import'` `formatter`. ``` sql postgres=> CREATE EXTERNAL TABLE salesinfo_hiveORCprofile(location text, month text, num_orders int, total_sales float8) @@ -565,6 +577,44 @@ In the following example, you will create a Hive table stored in ORC format and Time: 425.416 ms ``` +### <a id="using-hiveorc-profile"></a>Example: Using the HiveVectorizedORC Profile + +In the following example, you will use the `HiveVectorizedORC` profile to query the `sales_info_ORC` Hive table you created in the previous example. + +**Note**: The `HiveVectorizedORC` profile does not support the timestamp data type or complex types. + +1. Start the `psql` subsystem: + + ``` shell + $ psql -d postgres + ``` + +2. Use the PXF `HiveVectorizedORC` profile to create a readable HAWQ external table from the Hive table named `sales_info_ORC` that you created in Step 1 of the previous example. The `FORMAT` clause must specify `'CUSTOM'`. The `HiveVectorizedORC` `CUSTOM` format supports only the built-in `'pxfwritable_import'` `formatter`. + + ``` sql + postgres=> CREATE EXTERNAL TABLE salesinfo_hiveVectORC(location text, month text, num_orders int, total_sales float8) + LOCATION ('pxf://namenode:51200/default.sales_info_ORC?PROFILE=HiveVectorizedORC') + FORMAT 'CUSTOM' (formatter='pxfwritable_import'); + ``` + +3. Query the external table: + + ``` sql + postgres=> SELECT * FROM salesinfo_hiveVectORC; + ``` + + ``` pre + location | month | number_of_orders | total_sales + ---------------+-------+------------------+------------- + Prague | Jan | 101 | 4875.33 + Rome | Mar | 87 | 1557.39 + Bangalore | May | 317 | 8936.99 + ... + + Time: 425.416 ms + ``` + + ## <a id="topic_dbb_nz3_ts"></a>Accessing Parquet-Format Hive Tables The PXF `Hive` profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types. For example, if a Hive table is created using: @@ -664,7 +714,7 @@ When specifying an array field in a Hive table, you must identify the terminator ... ``` -6. Use the PXF `Hive` profile to create a queryable HAWQ external table representing the Hive `table_complextypes`: +6. Use the PXF `Hive` profile to create a readable HAWQ external table representing the Hive `table_complextypes`: ``` sql postgres=# CREATE EXTERNAL TABLE complextypes_hiveprofile(index int, name text, intarray text, propmap text) @@ -744,7 +794,7 @@ In the following example, you will create a Hive table stored in ORC format. You $ psql -d postgres ``` -4. Use the PXF `HiveORC` profile to create a queryable HAWQ external table from the Hive table named `table_complextypes_ORC` you created in Step 1. The `FORMAT` clause must specify `'CUSTOM'`. The `HiveORC` `CUSTOM` format supports only the built-in `'pxfwritable_import'` `formatter`. +4. Use the PXF `HiveORC` profile to create a readable HAWQ external table from the Hive table named `table_complextypes_ORC` you created in Step 1. The `FORMAT` clause must specify `'CUSTOM'`. The `HiveORC` `CUSTOM` format supports only the built-in `'pxfwritable_import'` `formatter`. ``` sql postgres=> CREATE EXTERNAL TABLE complextypes_hiveorc(index int, name text, intarray text, propmap text) http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/abea8085/markdown/pxf/PXFExternalTableandAPIReference.html.md.erb ---------------------------------------------------------------------- diff --git a/markdown/pxf/PXFExternalTableandAPIReference.html.md.erb b/markdown/pxf/PXFExternalTableandAPIReference.html.md.erb index 096d41d..4ec1b72 100644 --- a/markdown/pxf/PXFExternalTableandAPIReference.html.md.erb +++ b/markdown/pxf/PXFExternalTableandAPIReference.html.md.erb @@ -58,7 +58,7 @@ FORMAT 'custom' (formatter='pxfwritable_import|pxfwritable_export'); | \<host\> | The PXF host. While \<host\> may identify any PXF agent node, use the HDFS NameNode as it is guaranteed to be available in a running HDFS cluster. If HDFS High Availability is enabled, \<host\> must identify the HDFS NameService. | | \<port\> | The PXF port. If \<port\> is omitted, PXF assumes \<host\> identifies a High Availability HDFS Nameservice and connects to the port number designated by the `pxf_service_port` server configuration parameter value. Default is 51200. | | \<path\-to\-data\> | A directory, file name, wildcard pattern, table name, etc. | -| PROFILE | The profile PXF uses to access the data. PXF supports multiple plug-ins that currently expose profiles named `HBase`, `Hive`, `HiveRC`, `HiveText`, `HiveORC`, `HdfsTextSimple`, `HdfsTextMulti`, `Avro`, `SequenceWritable`, and `Json`. | +| PROFILE | The profile PXF uses to access the data. PXF supports multiple plug-ins that currently expose profiles named `HBase`, `Hive`, `HiveRC`, `HiveText`, `HiveORC`, `HiveVectorizedORC`, `HdfsTextSimple`, `HdfsTextMulti`, `Avro`, `SequenceWritable`, and `Json`. | | FRAGMENTER | The Java class the plug-in uses for fragmenting data. Used for READABLE external tables only. | | ACCESSOR | The Java class the plug-in uses for accessing the data. Used for READABLE and WRITABLE tables. | | RESOLVER | The Java class the plug-in uses for serializing and deserializing the data. Used for READABLE and WRITABLE tables. | @@ -504,6 +504,10 @@ The `Accessor` retrieves specific fragments and passes records back to the Resol <td>org.apache.hawq.pxf.plugins.hive.HiveORCAccessor</td> <td>Accessor for Hive tables stored as ORC format </td> </tr> +<tr class="even"> +<td>org.apache.hawq.pxf.plugins.hive.HiveORCVectorizedAccessor</td> +<td>Accessor for Hive tables stored as ORC format </td> +</tr> </tr> <tr class="odd"> <td>org.apache.hawq.pxf.plugins.json.JsonAccessor</td> @@ -664,10 +668,14 @@ DataType.TIMESTAMP</code></pre></td> <td>org.apache.hawq.pxf.plugins.hive.HiveColumnarSerdeResolver</td> <td>Specialized <code class="ph codeph">HiveResolver</code> for a Hive table stored as RC file. Should be used together with <code class="ph codeph">HiveInputFormatFragmenter</code>/<code class="ph codeph">HiveRCFileAccessor</code>.</td> </tr> -<tr class="odd"> +<tr class="even"> <td>org.apache.hawq.pxf.plugins.hive.HiveORCSerdeResolver</td> <td>Specialized <code class="ph codeph">HiveResolver</code> for a Hive table stored in ORC format. Should be used together with <code class="ph codeph">HiveInputFormatFragmenter</code>/<code class="ph codeph">HiveORCAccessor</code>.</td> </tr> +<tr class="odd"> +<td>org.apache.hawq.pxf.plugins.hive.HiveORCVectorizedResolver</td> +<td>Specialized <code class="ph codeph">HiveResolver</code> for a Hive table stored in ORC format. Should be used together with <code class="ph codeph">HiveInputFormatFragmenter</code>/<code class="ph codeph">HiveORCVectorizedAccessor</code>.</td> +</tr> </tbody> </table> @@ -1103,12 +1111,8 @@ String colName = filterColumn.columnName(); This section contains the following information: -- [External Table Examples](#externaltableexamples) -- [Plug-in Examples](#pluginexamples) - -- **[External Table Examples](../pxf/PXFExternalTableandAPIReference.html#externaltableexamples)** - -- **[Plug-in Examples](../pxf/PXFExternalTableandAPIReference.html#pluginexamples)** +- **[External Table Examples](#externaltableexamples)** +- **[Plug-in Examples](#pluginexamples)** ### <a id="externaltableexamples"></a>External Table Examples http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/abea8085/markdown/pxf/ReadWritePXF.html.md.erb ---------------------------------------------------------------------- diff --git a/markdown/pxf/ReadWritePXF.html.md.erb b/markdown/pxf/ReadWritePXF.html.md.erb index 5c29ae8..0173ca6 100644 --- a/markdown/pxf/ReadWritePXF.html.md.erb +++ b/markdown/pxf/ReadWritePXF.html.md.erb @@ -105,6 +105,18 @@ Note: The <code class="ph codeph">DELIMITER</code> parameter is mandatory. <li>org.apache.hawq.pxf.service.io.GPDBWritable</li> </ul></td> </tr> +<tr class="even"> +<td>HiveVectorizedORC</td> +<td>Optimized bulk/batch read of a Hive table where each partition is stored as an ORC file. +</td> +<td><ul> +<li>org.apache.hawq.pxf.plugins.hive.HiveInputFormatFragmenter</li> +<li>org.apache.hawq.pxf.plugins.hive.HiveORCVectorizedAccessor</li> +<li>org.apache.hawq.pxf.plugins.hive.HiveORCVectorizedResolver</li> +<li>org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher</li> +<li>org.apache.hawq.pxf.service.io.GPDBWritable</li> +</ul></td> +</tr> <tr class="odd"> <td>HiveText</td> <td>Optimized read of a Hive table where each partition is stored as a text file.
