[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Github user asfgit closed the pull request at: https://github.com/apache/incubator-hawq-docs/pull/39 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Github user lisakowen commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85424776 --- Diff: pxf/HivePXF.html.md.erb --- @@ -339,21 +601,21 @@ postgres=# CREATE EXTERNAL TABLE pxf_sales_part( delivery_state TEXT, delivery_city TEXT ) -LOCATION ('pxf://namenode_host:51200/sales_part?Profile=Hive') +LOCATION ('pxf://namenode:51200/sales_part?Profile=Hive') FORMAT 'custom' (FORMATTER='pxfwritable_import'); postgres=# SELECT * FROM pxf_sales_part; ``` -### Example +### Query Without Pushdown In the following example, the HAWQ query filters the `delivery_city` partition `Sacramento`. The filter on `item_name` is not pushed down, since it is not a partition column. It is performed on the HAWQ side after all the data on `Sacramento` is transferred for processing. ``` sql -postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'shirt'; +postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'cube'; ``` -### Example +### Query With Pushdown --- End diff -- will also need to add this GUC to the documentation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Github user lisakowen commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85408122 --- Diff: pxf/HivePXF.html.md.erb --- @@ -2,121 +2,450 @@ title: Accessing Hive Data --- -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog. +Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase. + +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include: + +- Creating an external table in PXF and querying that table +- Querying Hive tables via PXF's integration with HCatalog ## Prerequisites -Check the following before using PXF to access Hive: +Before accessing Hive data with HAWQ and PXF, ensure that: -- The PXF HDFS plug-in is installed on all cluster nodes. +- The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information. - The PXF Hive plug-in is installed on all cluster nodes. - The Hive JAR files and conf directory are installed on all cluster nodes. -- Test PXF on HDFS before connecting to Hive or HBase. +- You have tested PXF on HDFS. - You are running the Hive Metastore service on a machine in your cluster. - You have set the `hive.metastore.uris` property in the `hive-site.xml` on the NameNode. +## Hive File Formats + +Hive supports several file formats: + +- TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation +- SequenceFile - flat file consisting of binary key/value pairs +- RCFile - record columnar data consisting of binary key/value pairs; high row compression rate +- ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size +- Parquet - compressed columnar data representation +- Avro - JSON-defined, schema-based data serialization format + +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive. + +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include: + +- `Hive` +- `HiveText` +- `HiveRC` + +## Data Type Mapping + +### Primitive Data Types + +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type. + +The following table summarizes external mapping rules for Hive primitive types. + +| Hive Data Type | Hawq Data Type | +|---|---| +| boolean| bool | +| int | int4 | +| smallint | int2 | +| tinyint | int2 | +| bigint | int8 | +| decimal | numeric | +| float | float4 | +| double | float8 | +| string | text | +| binary | bytea | +| char | bpchar | +| varchar | varchar | +| timestamp | timestamp | +| date | date | + + +### Complex Data Types + +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`. While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types. + +An example using complex data types is provided later in this topic. + + +## Sample Data Set + +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types: + +- location - text +- month - text +- number\_of\_orders - integer +- total\_sales - double + +Prepare the sample data set for use: + +1. First, create a text file: + +``` +$ vi /tmp/pxf_hive_datafile.txt +``` + +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values: + +``` +Prague,Jan,101,4875.33 +Rome,Mar,87,1557.39 +Bangalore,May,317,8936.99 +Beijing,Jul,411,11600.67 +San Francisco,Sept,156,6846.34 +Paris,Nov,159,7134.56 +San Francisco,Jan,113,5397.89 +Prague,Dec,333,9894.77
[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Github user lisakowen commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85407700 --- Diff: pxf/HivePXF.html.md.erb --- @@ -2,121 +2,450 @@ title: Accessing Hive Data --- -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog. +Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase. + +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include: + +- Creating an external table in PXF and querying that table +- Querying Hive tables via PXF's integration with HCatalog ## Prerequisites -Check the following before using PXF to access Hive: +Before accessing Hive data with HAWQ and PXF, ensure that: -- The PXF HDFS plug-in is installed on all cluster nodes. +- The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information. - The PXF Hive plug-in is installed on all cluster nodes. - The Hive JAR files and conf directory are installed on all cluster nodes. -- Test PXF on HDFS before connecting to Hive or HBase. +- You have tested PXF on HDFS. - You are running the Hive Metastore service on a machine in your cluster. - You have set the `hive.metastore.uris` property in the `hive-site.xml` on the NameNode. +## Hive File Formats + +Hive supports several file formats: + +- TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation +- SequenceFile - flat file consisting of binary key/value pairs +- RCFile - record columnar data consisting of binary key/value pairs; high row compression rate +- ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size +- Parquet - compressed columnar data representation +- Avro - JSON-defined, schema-based data serialization format + +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive. + +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include: + +- `Hive` +- `HiveText` +- `HiveRC` + +## Data Type Mapping + +### Primitive Data Types + +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type. + +The following table summarizes external mapping rules for Hive primitive types. + +| Hive Data Type | Hawq Data Type | +|---|---| +| boolean| bool | +| int | int4 | +| smallint | int2 | +| tinyint | int2 | +| bigint | int8 | +| decimal | numeric | +| float | float4 | +| double | float8 | +| string | text | +| binary | bytea | +| char | bpchar | +| varchar | varchar | +| timestamp | timestamp | +| date | date | + + +### Complex Data Types + +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`. While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types. + +An example using complex data types is provided later in this topic. + + +## Sample Data Set + +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types: + +- location - text +- month - text +- number\_of\_orders - integer +- total\_sales - double + +Prepare the sample data set for use: + +1. First, create a text file: + +``` +$ vi /tmp/pxf_hive_datafile.txt +``` + +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values: + +``` +Prague,Jan,101,4875.33 +Rome,Mar,87,1557.39 +Bangalore,May,317,8936.99 +Beijing,Jul,411,11600.67 +San Francisco,Sept,156,6846.34 +Paris,Nov,159,7134.56 +San Francisco,Jan,113,5397.89 +Prague,Dec,333,9894.77
[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Github user lisakowen commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85407620 --- Diff: pxf/HivePXF.html.md.erb --- @@ -339,21 +601,21 @@ postgres=# CREATE EXTERNAL TABLE pxf_sales_part( delivery_state TEXT, delivery_city TEXT ) -LOCATION ('pxf://namenode_host:51200/sales_part?Profile=Hive') +LOCATION ('pxf://namenode:51200/sales_part?Profile=Hive') FORMAT 'custom' (FORMATTER='pxfwritable_import'); postgres=# SELECT * FROM pxf_sales_part; ``` -### Example +### Query Without Pushdown In the following example, the HAWQ query filters the `delivery_city` partition `Sacramento`. The filter on `item_name` is not pushed down, since it is not a partition column. It is performed on the HAWQ side after all the data on `Sacramento` is transferred for processing. ``` sql -postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'shirt'; +postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'cube'; ``` -### Example +### Query With Pushdown --- End diff -- yes, this is good info to share with the user! i checked out the code, and it looks like this GUC is on by default. i will add some text to that effect in the appropriate section. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Github user lisakowen commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85403929 --- Diff: pxf/HivePXF.html.md.erb --- @@ -2,121 +2,450 @@ title: Accessing Hive Data --- -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog. +Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase. + +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include: + +- Creating an external table in PXF and querying that table +- Querying Hive tables via PXF's integration with HCatalog ## Prerequisites -Check the following before using PXF to access Hive: +Before accessing Hive data with HAWQ and PXF, ensure that: -- The PXF HDFS plug-in is installed on all cluster nodes. +- The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information. - The PXF Hive plug-in is installed on all cluster nodes. - The Hive JAR files and conf directory are installed on all cluster nodes. -- Test PXF on HDFS before connecting to Hive or HBase. +- You have tested PXF on HDFS. - You are running the Hive Metastore service on a machine in your cluster. - You have set the `hive.metastore.uris` property in the `hive-site.xml` on the NameNode. +## Hive File Formats + +Hive supports several file formats: + +- TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation +- SequenceFile - flat file consisting of binary key/value pairs +- RCFile - record columnar data consisting of binary key/value pairs; high row compression rate +- ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size +- Parquet - compressed columnar data representation +- Avro - JSON-defined, schema-based data serialization format + +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive. + +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include: + +- `Hive` +- `HiveText` +- `HiveRC` + +## Data Type Mapping + +### Primitive Data Types + +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type. + +The following table summarizes external mapping rules for Hive primitive types. + +| Hive Data Type | Hawq Data Type | +|---|---| +| boolean| bool | +| int | int4 | +| smallint | int2 | +| tinyint | int2 | +| bigint | int8 | +| decimal | numeric | +| float | float4 | +| double | float8 | +| string | text | +| binary | bytea | +| char | bpchar | +| varchar | varchar | +| timestamp | timestamp | +| date | date | + + +### Complex Data Types + +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`. While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types. + +An example using complex data types is provided later in this topic. + + +## Sample Data Set + +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types: + +- location - text +- month - text +- number\_of\_orders - integer +- total\_sales - double + +Prepare the sample data set for use: + +1. First, create a text file: + +``` +$ vi /tmp/pxf_hive_datafile.txt +``` + +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values: + +``` +Prague,Jan,101,4875.33 +Rome,Mar,87,1557.39 +Bangalore,May,317,8936.99 +Beijing,Jul,411,11600.67 +San Francisco,Sept,156,6846.34 +Paris,Nov,159,7134.56 +San Francisco,Jan,113,5397.89 +Prague,Dec,333,9894.77
[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Github user kavinderd commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85378363 --- Diff: pxf/HivePXF.html.md.erb --- @@ -339,21 +601,21 @@ postgres=# CREATE EXTERNAL TABLE pxf_sales_part( delivery_state TEXT, delivery_city TEXT ) -LOCATION ('pxf://namenode_host:51200/sales_part?Profile=Hive') +LOCATION ('pxf://namenode:51200/sales_part?Profile=Hive') FORMAT 'custom' (FORMATTER='pxfwritable_import'); postgres=# SELECT * FROM pxf_sales_part; ``` -### Example +### Query Without Pushdown In the following example, the HAWQ query filters the `delivery_city` partition `Sacramento`. The filter on `item_name` is not pushed down, since it is not a partition column. It is performed on the HAWQ side after all the data on `Sacramento` is transferred for processing. ``` sql -postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'shirt'; +postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'cube'; ``` -### Example +### Query With Pushdown --- End diff -- Somewhere it should be stated that the HAWQ GUC `pxf_enable_filter_pushdown` needs to be turned on. If this is off no filter pushdown will occur regardless of the nature of the query. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Github user kavinderd commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85376355 --- Diff: pxf/HivePXF.html.md.erb --- @@ -2,121 +2,450 @@ title: Accessing Hive Data --- -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog. +Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase. + +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include: + +- Creating an external table in PXF and querying that table +- Querying Hive tables via PXF's integration with HCatalog ## Prerequisites -Check the following before using PXF to access Hive: +Before accessing Hive data with HAWQ and PXF, ensure that: -- The PXF HDFS plug-in is installed on all cluster nodes. +- The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information. - The PXF Hive plug-in is installed on all cluster nodes. - The Hive JAR files and conf directory are installed on all cluster nodes. -- Test PXF on HDFS before connecting to Hive or HBase. +- You have tested PXF on HDFS. - You are running the Hive Metastore service on a machine in your cluster. - You have set the `hive.metastore.uris` property in the `hive-site.xml` on the NameNode. +## Hive File Formats + +Hive supports several file formats: + +- TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation +- SequenceFile - flat file consisting of binary key/value pairs +- RCFile - record columnar data consisting of binary key/value pairs; high row compression rate +- ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size +- Parquet - compressed columnar data representation +- Avro - JSON-defined, schema-based data serialization format + +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive. + +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include: + +- `Hive` +- `HiveText` +- `HiveRC` + +## Data Type Mapping + +### Primitive Data Types + +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type. + +The following table summarizes external mapping rules for Hive primitive types. + +| Hive Data Type | Hawq Data Type | +|---|---| +| boolean| bool | +| int | int4 | +| smallint | int2 | +| tinyint | int2 | +| bigint | int8 | +| decimal | numeric | +| float | float4 | +| double | float8 | +| string | text | +| binary | bytea | +| char | bpchar | +| varchar | varchar | +| timestamp | timestamp | +| date | date | + + +### Complex Data Types + +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`. While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types. + +An example using complex data types is provided later in this topic. + + +## Sample Data Set + +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types: + +- location - text +- month - text +- number\_of\_orders - integer +- total\_sales - double + +Prepare the sample data set for use: + +1. First, create a text file: + +``` +$ vi /tmp/pxf_hive_datafile.txt +``` + +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values: + +``` +Prague,Jan,101,4875.33 +Rome,Mar,87,1557.39 +Bangalore,May,317,8936.99 +Beijing,Jul,411,11600.67 +San Francisco,Sept,156,6846.34 +Paris,Nov,159,7134.56 +San Francisco,Jan,113,5397.89 +Prague,Dec,333,9894.77
[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Github user kavinderd commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85376289 --- Diff: pxf/HivePXF.html.md.erb --- @@ -2,121 +2,450 @@ title: Accessing Hive Data --- -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog. +Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase. + +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include: + +- Creating an external table in PXF and querying that table +- Querying Hive tables via PXF's integration with HCatalog ## Prerequisites -Check the following before using PXF to access Hive: +Before accessing Hive data with HAWQ and PXF, ensure that: -- The PXF HDFS plug-in is installed on all cluster nodes. +- The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information. - The PXF Hive plug-in is installed on all cluster nodes. - The Hive JAR files and conf directory are installed on all cluster nodes. -- Test PXF on HDFS before connecting to Hive or HBase. +- You have tested PXF on HDFS. - You are running the Hive Metastore service on a machine in your cluster. - You have set the `hive.metastore.uris` property in the `hive-site.xml` on the NameNode. +## Hive File Formats + +Hive supports several file formats: + +- TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation +- SequenceFile - flat file consisting of binary key/value pairs +- RCFile - record columnar data consisting of binary key/value pairs; high row compression rate +- ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size +- Parquet - compressed columnar data representation +- Avro - JSON-defined, schema-based data serialization format + +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive. + +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include: + +- `Hive` +- `HiveText` +- `HiveRC` + +## Data Type Mapping + +### Primitive Data Types + +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type. + +The following table summarizes external mapping rules for Hive primitive types. + +| Hive Data Type | Hawq Data Type | +|---|---| +| boolean| bool | +| int | int4 | +| smallint | int2 | +| tinyint | int2 | +| bigint | int8 | +| decimal | numeric | +| float | float4 | +| double | float8 | +| string | text | +| binary | bytea | +| char | bpchar | +| varchar | varchar | +| timestamp | timestamp | +| date | date | + + +### Complex Data Types + +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`. While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types. + +An example using complex data types is provided later in this topic. + + +## Sample Data Set + +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types: + +- location - text +- month - text +- number\_of\_orders - integer +- total\_sales - double + +Prepare the sample data set for use: + +1. First, create a text file: + +``` +$ vi /tmp/pxf_hive_datafile.txt +``` + +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values: + +``` +Prague,Jan,101,4875.33 +Rome,Mar,87,1557.39 +Bangalore,May,317,8936.99 +Beijing,Jul,411,11600.67 +San Francisco,Sept,156,6846.34 +Paris,Nov,159,7134.56 +San Francisco,Jan,113,5397.89 +Prague,Dec,333,9894.77
[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Github user dyozie commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85367943 --- Diff: pxf/HivePXF.html.md.erb --- @@ -2,121 +2,450 @@ title: Accessing Hive Data --- -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog. +Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase. + +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include: + +- Creating an external table in PXF and querying that table +- Querying Hive tables via PXF's integration with HCatalog ## Prerequisites -Check the following before using PXF to access Hive: +Before accessing Hive data with HAWQ and PXF, ensure that: -- The PXF HDFS plug-in is installed on all cluster nodes. +- The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information. - The PXF Hive plug-in is installed on all cluster nodes. - The Hive JAR files and conf directory are installed on all cluster nodes. -- Test PXF on HDFS before connecting to Hive or HBase. +- You have tested PXF on HDFS. - You are running the Hive Metastore service on a machine in your cluster. - You have set the `hive.metastore.uris` property in the `hive-site.xml` on the NameNode. +## Hive File Formats + +Hive supports several file formats: + +- TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation +- SequenceFile - flat file consisting of binary key/value pairs +- RCFile - record columnar data consisting of binary key/value pairs; high row compression rate +- ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size +- Parquet - compressed columnar data representation +- Avro - JSON-defined, schema-based data serialization format + +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive. + +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include: + +- `Hive` +- `HiveText` +- `HiveRC` + +## Data Type Mapping + +### Primitive Data Types + +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type. + +The following table summarizes external mapping rules for Hive primitive types. + +| Hive Data Type | Hawq Data Type | +|---|---| +| boolean| bool | +| int | int4 | +| smallint | int2 | +| tinyint | int2 | +| bigint | int8 | +| decimal | numeric | +| float | float4 | +| double | float8 | +| string | text | +| binary | bytea | +| char | bpchar | +| varchar | varchar | +| timestamp | timestamp | +| date | date | + + +### Complex Data Types + +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`. While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types. + +An example using complex data types is provided later in this topic. + + +## Sample Data Set + +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types: + +- location - text +- month - text +- number\_of\_orders - integer +- total\_sales - double + +Prepare the sample data set for use: + +1. First, create a text file: + +``` +$ vi /tmp/pxf_hive_datafile.txt +``` + +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values: + +``` +Prague,Jan,101,4875.33 +Rome,Mar,87,1557.39 +Bangalore,May,317,8936.99 +Beijing,Jul,411,11600.67 +San Francisco,Sept,156,6846.34 +Paris,Nov,159,7134.56 +San Francisco,Jan,113,5397.89 +Prague,Dec,333,9894.77
[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Github user dyozie commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85371576 --- Diff: pxf/HivePXF.html.md.erb --- @@ -151,184 +477,120 @@ To enable HCatalog query integration in HAWQ, perform the following steps: postgres=# GRANT ALL ON PROTOCOL pxf TO "role"; ``` -3. To query a Hive table with HCatalog integration, simply query HCatalog directly from HAWQ. The query syntax is: -``` sql -postgres=# SELECT * FROM hcatalog.hive-db-name.hive-table-name; -``` +To query a Hive table with HCatalog integration, query HCatalog directly from HAWQ. The query syntax is: + +``` sql +postgres=# SELECT * FROM hcatalog.hive-db-name.hive-table-name; +``` -For example: +For example: -``` sql -postgres=# SELECT * FROM hcatalog.default.sales; -``` - -4. To obtain a description of a Hive table with HCatalog integration, you can use the `psql` client interface. -- Within HAWQ, use either the `\d hcatalog.hive-db-name.hive-table-name` or `\d+ hcatalog.hive-db-name.hive-table-name` commands to describe a single table. For example, from the `psql` client interface: - -``` shell -$ psql -d postgres -postgres=# \d hcatalog.default.test - -PXF Hive Table "default.test" -Column| Type ---+ - name | text - type | text - supplier_key | int4 - full_price | float8 -``` -- Use `\d hcatalog.hive-db-name.*` to describe the whole database schema. For example: - -``` shell -postgres=# \d hcatalog.default.* - -PXF Hive Table "default.test" -Column| Type ---+ - type | text - name | text - supplier_key | int4 - full_price | float8 - -PXF Hive Table "default.testabc" - Column | Type -+-- - type | text - name | text -``` -- Use `\d hcatalog.*.*` to describe the whole schema: - -``` shell -postgres=# \d hcatalog.*.* - -PXF Hive Table "default.test" -Column| Type ---+ - type | text - name | text - supplier_key | int4 - full_price | float8 - -PXF Hive Table "default.testabc" - Column | Type -+-- - type | text - name | text - -PXF Hive Table "userdb.test" - Column | Type ---+-- - address | text - username | text - -``` - -**Note:** When using `\d` or `\d+` commands in the `psql` HAWQ client, `hcatalog` will not be listed as a database. If you use other `psql` compatible clients, `hcatalog` will be listed as a database with a size value of `-1` since `hcatalog` is not a real database in HAWQ. - -5. Alternatively, you can use the **pxf\_get\_item\_fields** user-defined function (UDF) to obtain Hive table descriptions from other client interfaces or third-party applications. The UDF takes a PXF profile and a table pattern string as its input parameters. - -**Note:** Currently the only supported input profile is `'Hive'`. - -For example, the following statement returns a description of a specific table. The description includes path, itemname (table), fieldname, and fieldtype. +``` sql +postgres=# SELECT * FROM hcatalog.default.sales_info; +``` + +To obtain a description of a Hive table with HCatalog integration, you can use the `psql` client interface. + +- Within HAWQ, use either the `\d hcatalog.hive-db-name.hive-table-name` or `\d+ hcatalog.hive-db-name.hive-table-name` commands to describe a single table. For example, from the `psql` client interface: + +``` shell +$ psql -d postgres +``` ``` sql -postgres=# select * from pxf_get_item_fields('Hive','default.test'); +postgres=# \d hcatalog.default.sales_info_rcfile; ``` - -``` pre - path | itemname | fieldname | fieldtype --+--+--+--- - default | test | name | text - default | test | type | text - default | test | supplier_key | int4
[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Github user dyozie commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85370681 --- Diff: pxf/HivePXF.html.md.erb --- @@ -151,184 +477,120 @@ To enable HCatalog query integration in HAWQ, perform the following steps: postgres=# GRANT ALL ON PROTOCOL pxf TO "role"; ``` -3. To query a Hive table with HCatalog integration, simply query HCatalog directly from HAWQ. The query syntax is: -``` sql -postgres=# SELECT * FROM hcatalog.hive-db-name.hive-table-name; -``` +To query a Hive table with HCatalog integration, query HCatalog directly from HAWQ. The query syntax is: --- End diff -- It's a bit awkward to drop out of the procedure and into free-form discussion of the various operations. I think it might be better to put the previous 3-step procedure into a new subsection like "Enabling HCatalog Integration" and then putting the remaining non-procedural content into "Usage" ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Github user dyozie commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85368752 --- Diff: pxf/HivePXF.html.md.erb --- @@ -2,121 +2,450 @@ title: Accessing Hive Data --- -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog. +Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase. + +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include: + +- Creating an external table in PXF and querying that table +- Querying Hive tables via PXF's integration with HCatalog ## Prerequisites -Check the following before using PXF to access Hive: +Before accessing Hive data with HAWQ and PXF, ensure that: -- The PXF HDFS plug-in is installed on all cluster nodes. +- The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information. - The PXF Hive plug-in is installed on all cluster nodes. - The Hive JAR files and conf directory are installed on all cluster nodes. -- Test PXF on HDFS before connecting to Hive or HBase. +- You have tested PXF on HDFS. - You are running the Hive Metastore service on a machine in your cluster. - You have set the `hive.metastore.uris` property in the `hive-site.xml` on the NameNode. +## Hive File Formats + +Hive supports several file formats: + +- TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation +- SequenceFile - flat file consisting of binary key/value pairs +- RCFile - record columnar data consisting of binary key/value pairs; high row compression rate +- ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size +- Parquet - compressed columnar data representation +- Avro - JSON-defined, schema-based data serialization format + +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive. + +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include: + +- `Hive` +- `HiveText` +- `HiveRC` + +## Data Type Mapping + +### Primitive Data Types + +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type. + +The following table summarizes external mapping rules for Hive primitive types. + +| Hive Data Type | Hawq Data Type | +|---|---| +| boolean| bool | +| int | int4 | +| smallint | int2 | +| tinyint | int2 | +| bigint | int8 | +| decimal | numeric | +| float | float4 | +| double | float8 | +| string | text | +| binary | bytea | +| char | bpchar | +| varchar | varchar | +| timestamp | timestamp | +| date | date | + + +### Complex Data Types + +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`. While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types. + +An example using complex data types is provided later in this topic. + + +## Sample Data Set + +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types: + +- location - text +- month - text +- number\_of\_orders - integer +- total\_sales - double + +Prepare the sample data set for use: + +1. First, create a text file: + +``` +$ vi /tmp/pxf_hive_datafile.txt +``` + +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values: + +``` +Prague,Jan,101,4875.33 +Rome,Mar,87,1557.39 +Bangalore,May,317,8936.99 +Beijing,Jul,411,11600.67 +San Francisco,Sept,156,6846.34 +Paris,Nov,159,7134.56 +San Francisco,Jan,113,5397.89 +Prague,Dec,333,9894.77
[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Github user dyozie commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85365959 --- Diff: pxf/HivePXF.html.md.erb --- @@ -2,121 +2,450 @@ title: Accessing Hive Data --- -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog. +Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase. + +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include: + +- Creating an external table in PXF and querying that table +- Querying Hive tables via PXF's integration with HCatalog ## Prerequisites -Check the following before using PXF to access Hive: +Before accessing Hive data with HAWQ and PXF, ensure that: -- The PXF HDFS plug-in is installed on all cluster nodes. +- The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information. - The PXF Hive plug-in is installed on all cluster nodes. - The Hive JAR files and conf directory are installed on all cluster nodes. -- Test PXF on HDFS before connecting to Hive or HBase. +- You have tested PXF on HDFS. - You are running the Hive Metastore service on a machine in your cluster. - You have set the `hive.metastore.uris` property in the `hive-site.xml` on the NameNode. +## Hive File Formats + +Hive supports several file formats: + +- TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation +- SequenceFile - flat file consisting of binary key/value pairs +- RCFile - record columnar data consisting of binary key/value pairs; high row compression rate +- ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size +- Parquet - compressed columnar data representation +- Avro - JSON-defined, schema-based data serialization format + +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive. + +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include: + +- `Hive` +- `HiveText` +- `HiveRC` + +## Data Type Mapping + +### Primitive Data Types + +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type. + +The following table summarizes external mapping rules for Hive primitive types. + +| Hive Data Type | Hawq Data Type | +|---|---| +| boolean| bool | +| int | int4 | +| smallint | int2 | +| tinyint | int2 | +| bigint | int8 | +| decimal | numeric | +| float | float4 | +| double | float8 | +| string | text | +| binary | bytea | +| char | bpchar | +| varchar | varchar | +| timestamp | timestamp | +| date | date | + + +### Complex Data Types + +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`. While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types. + +An example using complex data types is provided later in this topic. + + +## Sample Data Set + +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types: + +- location - text +- month - text +- number\_of\_orders - integer +- total\_sales - double --- End diff -- Also consider term/definition table here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Github user dyozie commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85366470 --- Diff: pxf/HivePXF.html.md.erb --- @@ -2,121 +2,450 @@ title: Accessing Hive Data --- -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog. +Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase. + +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include: + +- Creating an external table in PXF and querying that table +- Querying Hive tables via PXF's integration with HCatalog ## Prerequisites -Check the following before using PXF to access Hive: +Before accessing Hive data with HAWQ and PXF, ensure that: -- The PXF HDFS plug-in is installed on all cluster nodes. +- The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information. - The PXF Hive plug-in is installed on all cluster nodes. - The Hive JAR files and conf directory are installed on all cluster nodes. -- Test PXF on HDFS before connecting to Hive or HBase. +- You have tested PXF on HDFS. - You are running the Hive Metastore service on a machine in your cluster. - You have set the `hive.metastore.uris` property in the `hive-site.xml` on the NameNode. +## Hive File Formats + +Hive supports several file formats: + +- TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation +- SequenceFile - flat file consisting of binary key/value pairs +- RCFile - record columnar data consisting of binary key/value pairs; high row compression rate +- ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size +- Parquet - compressed columnar data representation +- Avro - JSON-defined, schema-based data serialization format + +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive. + +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include: + +- `Hive` +- `HiveText` +- `HiveRC` + +## Data Type Mapping + +### Primitive Data Types + +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type. + +The following table summarizes external mapping rules for Hive primitive types. + +| Hive Data Type | Hawq Data Type | +|---|---| +| boolean| bool | +| int | int4 | +| smallint | int2 | +| tinyint | int2 | +| bigint | int8 | +| decimal | numeric | +| float | float4 | +| double | float8 | +| string | text | +| binary | bytea | +| char | bpchar | +| varchar | varchar | +| timestamp | timestamp | +| date | date | + + +### Complex Data Types + +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`. While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types. + +An example using complex data types is provided later in this topic. + + +## Sample Data Set + +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types: + +- location - text +- month - text +- number\_of\_orders - integer +- total\_sales - double + +Prepare the sample data set for use: + +1. First, create a text file: + +``` +$ vi /tmp/pxf_hive_datafile.txt +``` + +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values: + +``` +Prague,Jan,101,4875.33 +Rome,Mar,87,1557.39 +Bangalore,May,317,8936.99 +Beijing,Jul,411,11600.67 +San Francisco,Sept,156,6846.34 +Paris,Nov,159,7134.56 +San Francisco,Jan,113,5397.89 +Prague,Dec,333,9894.77
[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Github user dyozie commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85367789 --- Diff: pxf/HivePXF.html.md.erb --- @@ -2,121 +2,450 @@ title: Accessing Hive Data --- -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog. +Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase. + +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include: + +- Creating an external table in PXF and querying that table +- Querying Hive tables via PXF's integration with HCatalog ## Prerequisites -Check the following before using PXF to access Hive: +Before accessing Hive data with HAWQ and PXF, ensure that: -- The PXF HDFS plug-in is installed on all cluster nodes. +- The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information. - The PXF Hive plug-in is installed on all cluster nodes. - The Hive JAR files and conf directory are installed on all cluster nodes. -- Test PXF on HDFS before connecting to Hive or HBase. +- You have tested PXF on HDFS. - You are running the Hive Metastore service on a machine in your cluster. - You have set the `hive.metastore.uris` property in the `hive-site.xml` on the NameNode. +## Hive File Formats + +Hive supports several file formats: + +- TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation +- SequenceFile - flat file consisting of binary key/value pairs +- RCFile - record columnar data consisting of binary key/value pairs; high row compression rate +- ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size +- Parquet - compressed columnar data representation +- Avro - JSON-defined, schema-based data serialization format + +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive. + +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include: + +- `Hive` +- `HiveText` +- `HiveRC` + +## Data Type Mapping + +### Primitive Data Types + +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type. + +The following table summarizes external mapping rules for Hive primitive types. + +| Hive Data Type | Hawq Data Type | +|---|---| +| boolean| bool | +| int | int4 | +| smallint | int2 | +| tinyint | int2 | +| bigint | int8 | +| decimal | numeric | +| float | float4 | +| double | float8 | +| string | text | +| binary | bytea | +| char | bpchar | +| varchar | varchar | +| timestamp | timestamp | +| date | date | + + +### Complex Data Types + +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`. While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types. + +An example using complex data types is provided later in this topic. + + +## Sample Data Set + +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types: + +- location - text +- month - text +- number\_of\_orders - integer +- total\_sales - double + +Prepare the sample data set for use: + +1. First, create a text file: + +``` +$ vi /tmp/pxf_hive_datafile.txt +``` + +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values: + +``` +Prague,Jan,101,4875.33 +Rome,Mar,87,1557.39 +Bangalore,May,317,8936.99 +Beijing,Jul,411,11600.67 +San Francisco,Sept,156,6846.34 +Paris,Nov,159,7134.56 +San Francisco,Jan,113,5397.89 +Prague,Dec,333,9894.77
[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Github user dyozie commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85372086 --- Diff: pxf/HivePXF.html.md.erb --- @@ -2,121 +2,450 @@ title: Accessing Hive Data --- -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog. +Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase. + +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include: + +- Creating an external table in PXF and querying that table +- Querying Hive tables via PXF's integration with HCatalog ## Prerequisites -Check the following before using PXF to access Hive: +Before accessing Hive data with HAWQ and PXF, ensure that: -- The PXF HDFS plug-in is installed on all cluster nodes. +- The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information. - The PXF Hive plug-in is installed on all cluster nodes. - The Hive JAR files and conf directory are installed on all cluster nodes. -- Test PXF on HDFS before connecting to Hive or HBase. +- You have tested PXF on HDFS. - You are running the Hive Metastore service on a machine in your cluster. - You have set the `hive.metastore.uris` property in the `hive-site.xml` on the NameNode. +## Hive File Formats + +Hive supports several file formats: + +- TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation +- SequenceFile - flat file consisting of binary key/value pairs +- RCFile - record columnar data consisting of binary key/value pairs; high row compression rate +- ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size +- Parquet - compressed columnar data representation +- Avro - JSON-defined, schema-based data serialization format + +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive. + +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include: + +- `Hive` +- `HiveText` +- `HiveRC` + +## Data Type Mapping + +### Primitive Data Types + +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type. + +The following table summarizes external mapping rules for Hive primitive types. + +| Hive Data Type | Hawq Data Type | +|---|---| +| boolean| bool | +| int | int4 | +| smallint | int2 | +| tinyint | int2 | +| bigint | int8 | +| decimal | numeric | +| float | float4 | +| double | float8 | +| string | text | +| binary | bytea | +| char | bpchar | +| varchar | varchar | +| timestamp | timestamp | +| date | date | + + +### Complex Data Types + +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`. While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types. + +An example using complex data types is provided later in this topic. + + +## Sample Data Set + +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types: + +- location - text +- month - text +- number\_of\_orders - integer +- total\_sales - double + +Prepare the sample data set for use: + +1. First, create a text file: + +``` +$ vi /tmp/pxf_hive_datafile.txt +``` + +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values: + +``` +Prague,Jan,101,4875.33 +Rome,Mar,87,1557.39 +Bangalore,May,317,8936.99 +Beijing,Jul,411,11600.67 +San Francisco,Sept,156,6846.34 +Paris,Nov,159,7134.56 +San Francisco,Jan,113,5397.89 +Prague,Dec,333,9894.77
[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
GitHub user lisakowen opened a pull request: https://github.com/apache/incubator-hawq-docs/pull/39 HAWQ-1071 - add examples for HiveText and HiveRC plugins added examples, restructured content, added hive command line section. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lisakowen/incubator-hawq-docs feature/pxfhive-enhance Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hawq-docs/pull/39.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #39 commit 0398a62fefd3627273927f938b4d082a25bf3003 Author: Lisa OwenDate: 2016-09-26T21:37:04Z restructure PXF Hive pulug-in page; add more relevant examples commit 457d703a3f5c057e241acf985fbc35da34f6a075 Author: Lisa Owen Date: 2016-09-26T22:40:10Z PXF Hive plug-in mods commit 822d7545e746490e55507866c62dca5ea2d5349a Author: Lisa Owen Date: 2016-10-03T22:19:03Z clean up some extra whitespace commit 8c986b60b8db3edd77c10f23704cc9174c52a803 Author: Lisa Owen Date: 2016-10-11T18:37:34Z include list of hive profile names in file format section commit 150fa67857871d58ea05eb14c023215c932ab7b1 Author: Lisa Owen Date: 2016-10-11T19:03:39Z link to CREATE EXTERNAL TABLE ref page commit 5cdd8f8c35a51360fe3bfdedeff796bf1e0f31f3 Author: Lisa Owen Date: 2016-10-11T20:27:17Z sql commands all caps commit 67e8b9699c9eec64d04ce9e6048ffb385f7f3573 Author: Lisa Owen Date: 2016-10-11T20:33:35Z use <> for optional args commit 54b2c01a80d477cc093d7eb1ed2aa8c0bf762d36 Author: Lisa Owen Date: 2016-10-22T00:16:24Z fix some duplicate ids commit 284c3ec2db38e8d9020826e3bf292efad76c1819 Author: Lisa Owen Date: 2016-10-26T15:38:37Z restructure to use numbered steps commit 2a38a0322abda804cfd4fc8aa39f142f0d83ea11 Author: Lisa Owen Date: 2016-10-26T17:20:28Z note/notice --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---