[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-hawq-docs/pull/39


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-27 Thread lisakowen
Github user lisakowen commented on a diff in the pull request:

https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85424776
  
--- Diff: pxf/HivePXF.html.md.erb ---
@@ -339,21 +601,21 @@ postgres=# CREATE EXTERNAL TABLE pxf_sales_part(
   delivery_state TEXT, 
   delivery_city TEXT
 )
-LOCATION ('pxf://namenode_host:51200/sales_part?Profile=Hive')
+LOCATION ('pxf://namenode:51200/sales_part?Profile=Hive')
 FORMAT 'custom' (FORMATTER='pxfwritable_import');
 
 postgres=# SELECT * FROM pxf_sales_part;
 ```
 
-### Example
+### Query Without Pushdown
 
 In the following example, the HAWQ query filters the `delivery_city` 
partition `Sacramento`. The filter on  `item_name` is not pushed down, since 
it is not a partition column. It is performed on the HAWQ side after all the 
data on `Sacramento` is transferred for processing.
 
 ``` sql
-postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' 
AND item_name = 'shirt';
+postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' 
AND item_name = 'cube';
 ```
 
-### Example
+### Query With Pushdown
--- End diff --

will also need to add this GUC to the documentation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-27 Thread lisakowen
Github user lisakowen commented on a diff in the pull request:

https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85408122
  
--- Diff: pxf/HivePXF.html.md.erb ---
@@ -2,121 +2,450 @@
 title: Accessing Hive Data
 ---
 
-This topic describes how to access Hive data using PXF. You have several 
options for querying data stored in Hive. You can create external tables in PXF 
and then query those tables, or you can easily query Hive tables by using HAWQ 
and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored 
in HCatalog.
+Apache Hive is a distributed data warehousing infrastructure.  Hive 
facilitates managing large data sets supporting multiple data formats, 
including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive 
plug-in reads data stored in Hive, as well as HDFS or HBase.
+
+This section describes how to use PXF to access Hive data. Options for 
querying data stored in Hive include:
+
+-  Creating an external table in PXF and querying that table
+-  Querying Hive tables via PXF's integration with HCatalog
 
 ## Prerequisites
 
-Check the following before using PXF to access Hive:
+Before accessing Hive data with HAWQ and PXF, ensure that:
 
--   The PXF HDFS plug-in is installed on all cluster nodes.
+-   The PXF HDFS plug-in is installed on all cluster nodes. See 
[Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation 
information.
 -   The PXF Hive plug-in is installed on all cluster nodes.
 -   The Hive JAR files and conf directory are installed on all cluster 
nodes.
--   Test PXF on HDFS before connecting to Hive or HBase.
+-   You have tested PXF on HDFS.
 -   You are running the Hive Metastore service on a machine in your 
cluster. 
 -   You have set the `hive.metastore.uris` property in the 
`hive-site.xml` on the NameNode.
 
+## Hive File Formats
+
+Hive supports several file formats:
+
+-   TextFile - flat file with data in comma-, tab-, or space-separated 
value format or JSON notation
+-   SequenceFile - flat file consisting of binary key/value pairs
+-   RCFile - record columnar data consisting of binary key/value pairs; 
high row compression rate
+-   ORCFile - optimized row columnar data with stripe, footer, and 
postscript sections; reduces data size
+-   Parquet - compressed columnar data representation
+-   Avro - JSON-defined, schema-based data serialization format
+
+Refer to [File 
Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for 
detailed information about the file formats supported by Hive.
+
+The PXF Hive plug-in supports the following profiles for accessing the 
Hive file formats listed above. These include:
+
+- `Hive`
+- `HiveText`
+- `HiveRC`
+
+## Data Type Mapping
+
+### Primitive Data Types
+
+To represent Hive data in HAWQ, map data values that use a primitive data 
type to HAWQ columns of the same type.
+
+The following table summarizes external mapping rules for Hive primitive 
types.
+
+| Hive Data Type  | Hawq Data Type |
+|---|---|
+| boolean| bool |
+| int   | int4 |
+| smallint   | int2 |
+| tinyint   | int2 |
+| bigint   | int8 |
+| decimal  |  numeric  |
+| float   | float4 |
+| double   | float8 |
+| string   | text |
+| binary   | bytea |
+| char   | bpchar |
+| varchar   | varchar |
+| timestamp   | timestamp |
+| date   | date |
+
+
+### Complex Data Types
+
+Hive supports complex data types including array, struct, map, and union. 
PXF maps each of these complex types to `text`.  While HAWQ does not natively 
support these types, you can create HAWQ functions or application code to 
extract subcomponents of these complex data types.
+
+An example using complex data types is provided later in this topic.
+
+
+## Sample Data Set
+
+Examples used in this topic will operate on a common data set. This simple 
data set models a retail sales operation and includes fields with the following 
names and data types:
+
+- location - text
+- month - text
+- number\_of\_orders - integer
+- total\_sales - double
+
+Prepare the sample data set for use:
+
+1. First, create a text file:
+
+```
+$ vi /tmp/pxf_hive_datafile.txt
+```
+
+2. Add the following data to `pxf_hive_datafile.txt`; notice the use of 
the comma `,` to separate the four field values:
+
+```
+Prague,Jan,101,4875.33
+Rome,Mar,87,1557.39
+Bangalore,May,317,8936.99
+Beijing,Jul,411,11600.67
+San Francisco,Sept,156,6846.34
+Paris,Nov,159,7134.56
+San Francisco,Jan,113,5397.89
+Prague,Dec,333,9894.77
 

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-27 Thread lisakowen
Github user lisakowen commented on a diff in the pull request:

https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85407700
  
--- Diff: pxf/HivePXF.html.md.erb ---
@@ -2,121 +2,450 @@
 title: Accessing Hive Data
 ---
 
-This topic describes how to access Hive data using PXF. You have several 
options for querying data stored in Hive. You can create external tables in PXF 
and then query those tables, or you can easily query Hive tables by using HAWQ 
and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored 
in HCatalog.
+Apache Hive is a distributed data warehousing infrastructure.  Hive 
facilitates managing large data sets supporting multiple data formats, 
including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive 
plug-in reads data stored in Hive, as well as HDFS or HBase.
+
+This section describes how to use PXF to access Hive data. Options for 
querying data stored in Hive include:
+
+-  Creating an external table in PXF and querying that table
+-  Querying Hive tables via PXF's integration with HCatalog
 
 ## Prerequisites
 
-Check the following before using PXF to access Hive:
+Before accessing Hive data with HAWQ and PXF, ensure that:
 
--   The PXF HDFS plug-in is installed on all cluster nodes.
+-   The PXF HDFS plug-in is installed on all cluster nodes. See 
[Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation 
information.
 -   The PXF Hive plug-in is installed on all cluster nodes.
 -   The Hive JAR files and conf directory are installed on all cluster 
nodes.
--   Test PXF on HDFS before connecting to Hive or HBase.
+-   You have tested PXF on HDFS.
 -   You are running the Hive Metastore service on a machine in your 
cluster. 
 -   You have set the `hive.metastore.uris` property in the 
`hive-site.xml` on the NameNode.
 
+## Hive File Formats
+
+Hive supports several file formats:
+
+-   TextFile - flat file with data in comma-, tab-, or space-separated 
value format or JSON notation
+-   SequenceFile - flat file consisting of binary key/value pairs
+-   RCFile - record columnar data consisting of binary key/value pairs; 
high row compression rate
+-   ORCFile - optimized row columnar data with stripe, footer, and 
postscript sections; reduces data size
+-   Parquet - compressed columnar data representation
+-   Avro - JSON-defined, schema-based data serialization format
+
+Refer to [File 
Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for 
detailed information about the file formats supported by Hive.
+
+The PXF Hive plug-in supports the following profiles for accessing the 
Hive file formats listed above. These include:
+
+- `Hive`
+- `HiveText`
+- `HiveRC`
+
+## Data Type Mapping
+
+### Primitive Data Types
+
+To represent Hive data in HAWQ, map data values that use a primitive data 
type to HAWQ columns of the same type.
+
+The following table summarizes external mapping rules for Hive primitive 
types.
+
+| Hive Data Type  | Hawq Data Type |
+|---|---|
+| boolean| bool |
+| int   | int4 |
+| smallint   | int2 |
+| tinyint   | int2 |
+| bigint   | int8 |
+| decimal  |  numeric  |
+| float   | float4 |
+| double   | float8 |
+| string   | text |
+| binary   | bytea |
+| char   | bpchar |
+| varchar   | varchar |
+| timestamp   | timestamp |
+| date   | date |
+
+
+### Complex Data Types
+
+Hive supports complex data types including array, struct, map, and union. 
PXF maps each of these complex types to `text`.  While HAWQ does not natively 
support these types, you can create HAWQ functions or application code to 
extract subcomponents of these complex data types.
+
+An example using complex data types is provided later in this topic.
+
+
+## Sample Data Set
+
+Examples used in this topic will operate on a common data set. This simple 
data set models a retail sales operation and includes fields with the following 
names and data types:
+
+- location - text
+- month - text
+- number\_of\_orders - integer
+- total\_sales - double
+
+Prepare the sample data set for use:
+
+1. First, create a text file:
+
+```
+$ vi /tmp/pxf_hive_datafile.txt
+```
+
+2. Add the following data to `pxf_hive_datafile.txt`; notice the use of 
the comma `,` to separate the four field values:
+
+```
+Prague,Jan,101,4875.33
+Rome,Mar,87,1557.39
+Bangalore,May,317,8936.99
+Beijing,Jul,411,11600.67
+San Francisco,Sept,156,6846.34
+Paris,Nov,159,7134.56
+San Francisco,Jan,113,5397.89
+Prague,Dec,333,9894.77
 

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-27 Thread lisakowen
Github user lisakowen commented on a diff in the pull request:

https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85407620
  
--- Diff: pxf/HivePXF.html.md.erb ---
@@ -339,21 +601,21 @@ postgres=# CREATE EXTERNAL TABLE pxf_sales_part(
   delivery_state TEXT, 
   delivery_city TEXT
 )
-LOCATION ('pxf://namenode_host:51200/sales_part?Profile=Hive')
+LOCATION ('pxf://namenode:51200/sales_part?Profile=Hive')
 FORMAT 'custom' (FORMATTER='pxfwritable_import');
 
 postgres=# SELECT * FROM pxf_sales_part;
 ```
 
-### Example
+### Query Without Pushdown
 
 In the following example, the HAWQ query filters the `delivery_city` 
partition `Sacramento`. The filter on  `item_name` is not pushed down, since 
it is not a partition column. It is performed on the HAWQ side after all the 
data on `Sacramento` is transferred for processing.
 
 ``` sql
-postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' 
AND item_name = 'shirt';
+postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' 
AND item_name = 'cube';
 ```
 
-### Example
+### Query With Pushdown
--- End diff --

yes, this is good info to share with the user!  i checked out the code, and 
it looks like this GUC is on by default.  i will add some text to that effect 
in the appropriate section.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-27 Thread lisakowen
Github user lisakowen commented on a diff in the pull request:

https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85403929
  
--- Diff: pxf/HivePXF.html.md.erb ---
@@ -2,121 +2,450 @@
 title: Accessing Hive Data
 ---
 
-This topic describes how to access Hive data using PXF. You have several 
options for querying data stored in Hive. You can create external tables in PXF 
and then query those tables, or you can easily query Hive tables by using HAWQ 
and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored 
in HCatalog.
+Apache Hive is a distributed data warehousing infrastructure.  Hive 
facilitates managing large data sets supporting multiple data formats, 
including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive 
plug-in reads data stored in Hive, as well as HDFS or HBase.
+
+This section describes how to use PXF to access Hive data. Options for 
querying data stored in Hive include:
+
+-  Creating an external table in PXF and querying that table
+-  Querying Hive tables via PXF's integration with HCatalog
 
 ## Prerequisites
 
-Check the following before using PXF to access Hive:
+Before accessing Hive data with HAWQ and PXF, ensure that:
 
--   The PXF HDFS plug-in is installed on all cluster nodes.
+-   The PXF HDFS plug-in is installed on all cluster nodes. See 
[Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation 
information.
 -   The PXF Hive plug-in is installed on all cluster nodes.
 -   The Hive JAR files and conf directory are installed on all cluster 
nodes.
--   Test PXF on HDFS before connecting to Hive or HBase.
+-   You have tested PXF on HDFS.
 -   You are running the Hive Metastore service on a machine in your 
cluster. 
 -   You have set the `hive.metastore.uris` property in the 
`hive-site.xml` on the NameNode.
 
+## Hive File Formats
+
+Hive supports several file formats:
+
+-   TextFile - flat file with data in comma-, tab-, or space-separated 
value format or JSON notation
+-   SequenceFile - flat file consisting of binary key/value pairs
+-   RCFile - record columnar data consisting of binary key/value pairs; 
high row compression rate
+-   ORCFile - optimized row columnar data with stripe, footer, and 
postscript sections; reduces data size
+-   Parquet - compressed columnar data representation
+-   Avro - JSON-defined, schema-based data serialization format
+
+Refer to [File 
Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for 
detailed information about the file formats supported by Hive.
+
+The PXF Hive plug-in supports the following profiles for accessing the 
Hive file formats listed above. These include:
+
+- `Hive`
+- `HiveText`
+- `HiveRC`
+
+## Data Type Mapping
+
+### Primitive Data Types
+
+To represent Hive data in HAWQ, map data values that use a primitive data 
type to HAWQ columns of the same type.
+
+The following table summarizes external mapping rules for Hive primitive 
types.
+
+| Hive Data Type  | Hawq Data Type |
+|---|---|
+| boolean| bool |
+| int   | int4 |
+| smallint   | int2 |
+| tinyint   | int2 |
+| bigint   | int8 |
+| decimal  |  numeric  |
+| float   | float4 |
+| double   | float8 |
+| string   | text |
+| binary   | bytea |
+| char   | bpchar |
+| varchar   | varchar |
+| timestamp   | timestamp |
+| date   | date |
+
+
+### Complex Data Types
+
+Hive supports complex data types including array, struct, map, and union. 
PXF maps each of these complex types to `text`.  While HAWQ does not natively 
support these types, you can create HAWQ functions or application code to 
extract subcomponents of these complex data types.
+
+An example using complex data types is provided later in this topic.
+
+
+## Sample Data Set
+
+Examples used in this topic will operate on a common data set. This simple 
data set models a retail sales operation and includes fields with the following 
names and data types:
+
+- location - text
+- month - text
+- number\_of\_orders - integer
+- total\_sales - double
+
+Prepare the sample data set for use:
+
+1. First, create a text file:
+
+```
+$ vi /tmp/pxf_hive_datafile.txt
+```
+
+2. Add the following data to `pxf_hive_datafile.txt`; notice the use of 
the comma `,` to separate the four field values:
+
+```
+Prague,Jan,101,4875.33
+Rome,Mar,87,1557.39
+Bangalore,May,317,8936.99
+Beijing,Jul,411,11600.67
+San Francisco,Sept,156,6846.34
+Paris,Nov,159,7134.56
+San Francisco,Jan,113,5397.89
+Prague,Dec,333,9894.77
 

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-27 Thread kavinderd
Github user kavinderd commented on a diff in the pull request:

https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85378363
  
--- Diff: pxf/HivePXF.html.md.erb ---
@@ -339,21 +601,21 @@ postgres=# CREATE EXTERNAL TABLE pxf_sales_part(
   delivery_state TEXT, 
   delivery_city TEXT
 )
-LOCATION ('pxf://namenode_host:51200/sales_part?Profile=Hive')
+LOCATION ('pxf://namenode:51200/sales_part?Profile=Hive')
 FORMAT 'custom' (FORMATTER='pxfwritable_import');
 
 postgres=# SELECT * FROM pxf_sales_part;
 ```
 
-### Example
+### Query Without Pushdown
 
 In the following example, the HAWQ query filters the `delivery_city` 
partition `Sacramento`. The filter on  `item_name` is not pushed down, since 
it is not a partition column. It is performed on the HAWQ side after all the 
data on `Sacramento` is transferred for processing.
 
 ``` sql
-postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' 
AND item_name = 'shirt';
+postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' 
AND item_name = 'cube';
 ```
 
-### Example
+### Query With Pushdown
--- End diff --

Somewhere it should be stated that the HAWQ GUC 
`pxf_enable_filter_pushdown` needs to be turned on. If this is off no filter 
pushdown will occur regardless of the nature of the query.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-27 Thread kavinderd
Github user kavinderd commented on a diff in the pull request:

https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85376355
  
--- Diff: pxf/HivePXF.html.md.erb ---
@@ -2,121 +2,450 @@
 title: Accessing Hive Data
 ---
 
-This topic describes how to access Hive data using PXF. You have several 
options for querying data stored in Hive. You can create external tables in PXF 
and then query those tables, or you can easily query Hive tables by using HAWQ 
and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored 
in HCatalog.
+Apache Hive is a distributed data warehousing infrastructure.  Hive 
facilitates managing large data sets supporting multiple data formats, 
including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive 
plug-in reads data stored in Hive, as well as HDFS or HBase.
+
+This section describes how to use PXF to access Hive data. Options for 
querying data stored in Hive include:
+
+-  Creating an external table in PXF and querying that table
+-  Querying Hive tables via PXF's integration with HCatalog
 
 ## Prerequisites
 
-Check the following before using PXF to access Hive:
+Before accessing Hive data with HAWQ and PXF, ensure that:
 
--   The PXF HDFS plug-in is installed on all cluster nodes.
+-   The PXF HDFS plug-in is installed on all cluster nodes. See 
[Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation 
information.
 -   The PXF Hive plug-in is installed on all cluster nodes.
 -   The Hive JAR files and conf directory are installed on all cluster 
nodes.
--   Test PXF on HDFS before connecting to Hive or HBase.
+-   You have tested PXF on HDFS.
 -   You are running the Hive Metastore service on a machine in your 
cluster. 
 -   You have set the `hive.metastore.uris` property in the 
`hive-site.xml` on the NameNode.
 
+## Hive File Formats
+
+Hive supports several file formats:
+
+-   TextFile - flat file with data in comma-, tab-, or space-separated 
value format or JSON notation
+-   SequenceFile - flat file consisting of binary key/value pairs
+-   RCFile - record columnar data consisting of binary key/value pairs; 
high row compression rate
+-   ORCFile - optimized row columnar data with stripe, footer, and 
postscript sections; reduces data size
+-   Parquet - compressed columnar data representation
+-   Avro - JSON-defined, schema-based data serialization format
+
+Refer to [File 
Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for 
detailed information about the file formats supported by Hive.
+
+The PXF Hive plug-in supports the following profiles for accessing the 
Hive file formats listed above. These include:
+
+- `Hive`
+- `HiveText`
+- `HiveRC`
+
+## Data Type Mapping
+
+### Primitive Data Types
+
+To represent Hive data in HAWQ, map data values that use a primitive data 
type to HAWQ columns of the same type.
+
+The following table summarizes external mapping rules for Hive primitive 
types.
+
+| Hive Data Type  | Hawq Data Type |
+|---|---|
+| boolean| bool |
+| int   | int4 |
+| smallint   | int2 |
+| tinyint   | int2 |
+| bigint   | int8 |
+| decimal  |  numeric  |
+| float   | float4 |
+| double   | float8 |
+| string   | text |
+| binary   | bytea |
+| char   | bpchar |
+| varchar   | varchar |
+| timestamp   | timestamp |
+| date   | date |
+
+
+### Complex Data Types
+
+Hive supports complex data types including array, struct, map, and union. 
PXF maps each of these complex types to `text`.  While HAWQ does not natively 
support these types, you can create HAWQ functions or application code to 
extract subcomponents of these complex data types.
+
+An example using complex data types is provided later in this topic.
+
+
+## Sample Data Set
+
+Examples used in this topic will operate on a common data set. This simple 
data set models a retail sales operation and includes fields with the following 
names and data types:
+
+- location - text
+- month - text
+- number\_of\_orders - integer
+- total\_sales - double
+
+Prepare the sample data set for use:
+
+1. First, create a text file:
+
+```
+$ vi /tmp/pxf_hive_datafile.txt
+```
+
+2. Add the following data to `pxf_hive_datafile.txt`; notice the use of 
the comma `,` to separate the four field values:
+
+```
+Prague,Jan,101,4875.33
+Rome,Mar,87,1557.39
+Bangalore,May,317,8936.99
+Beijing,Jul,411,11600.67
+San Francisco,Sept,156,6846.34
+Paris,Nov,159,7134.56
+San Francisco,Jan,113,5397.89
+Prague,Dec,333,9894.77
 

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-27 Thread kavinderd
Github user kavinderd commented on a diff in the pull request:

https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85376289
  
--- Diff: pxf/HivePXF.html.md.erb ---
@@ -2,121 +2,450 @@
 title: Accessing Hive Data
 ---
 
-This topic describes how to access Hive data using PXF. You have several 
options for querying data stored in Hive. You can create external tables in PXF 
and then query those tables, or you can easily query Hive tables by using HAWQ 
and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored 
in HCatalog.
+Apache Hive is a distributed data warehousing infrastructure.  Hive 
facilitates managing large data sets supporting multiple data formats, 
including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive 
plug-in reads data stored in Hive, as well as HDFS or HBase.
+
+This section describes how to use PXF to access Hive data. Options for 
querying data stored in Hive include:
+
+-  Creating an external table in PXF and querying that table
+-  Querying Hive tables via PXF's integration with HCatalog
 
 ## Prerequisites
 
-Check the following before using PXF to access Hive:
+Before accessing Hive data with HAWQ and PXF, ensure that:
 
--   The PXF HDFS plug-in is installed on all cluster nodes.
+-   The PXF HDFS plug-in is installed on all cluster nodes. See 
[Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation 
information.
 -   The PXF Hive plug-in is installed on all cluster nodes.
 -   The Hive JAR files and conf directory are installed on all cluster 
nodes.
--   Test PXF on HDFS before connecting to Hive or HBase.
+-   You have tested PXF on HDFS.
 -   You are running the Hive Metastore service on a machine in your 
cluster. 
 -   You have set the `hive.metastore.uris` property in the 
`hive-site.xml` on the NameNode.
 
+## Hive File Formats
+
+Hive supports several file formats:
+
+-   TextFile - flat file with data in comma-, tab-, or space-separated 
value format or JSON notation
+-   SequenceFile - flat file consisting of binary key/value pairs
+-   RCFile - record columnar data consisting of binary key/value pairs; 
high row compression rate
+-   ORCFile - optimized row columnar data with stripe, footer, and 
postscript sections; reduces data size
+-   Parquet - compressed columnar data representation
+-   Avro - JSON-defined, schema-based data serialization format
+
+Refer to [File 
Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for 
detailed information about the file formats supported by Hive.
+
+The PXF Hive plug-in supports the following profiles for accessing the 
Hive file formats listed above. These include:
+
+- `Hive`
+- `HiveText`
+- `HiveRC`
+
+## Data Type Mapping
+
+### Primitive Data Types
+
+To represent Hive data in HAWQ, map data values that use a primitive data 
type to HAWQ columns of the same type.
+
+The following table summarizes external mapping rules for Hive primitive 
types.
+
+| Hive Data Type  | Hawq Data Type |
+|---|---|
+| boolean| bool |
+| int   | int4 |
+| smallint   | int2 |
+| tinyint   | int2 |
+| bigint   | int8 |
+| decimal  |  numeric  |
+| float   | float4 |
+| double   | float8 |
+| string   | text |
+| binary   | bytea |
+| char   | bpchar |
+| varchar   | varchar |
+| timestamp   | timestamp |
+| date   | date |
+
+
+### Complex Data Types
+
+Hive supports complex data types including array, struct, map, and union. 
PXF maps each of these complex types to `text`.  While HAWQ does not natively 
support these types, you can create HAWQ functions or application code to 
extract subcomponents of these complex data types.
+
+An example using complex data types is provided later in this topic.
+
+
+## Sample Data Set
+
+Examples used in this topic will operate on a common data set. This simple 
data set models a retail sales operation and includes fields with the following 
names and data types:
+
+- location - text
+- month - text
+- number\_of\_orders - integer
+- total\_sales - double
+
+Prepare the sample data set for use:
+
+1. First, create a text file:
+
+```
+$ vi /tmp/pxf_hive_datafile.txt
+```
+
+2. Add the following data to `pxf_hive_datafile.txt`; notice the use of 
the comma `,` to separate the four field values:
+
+```
+Prague,Jan,101,4875.33
+Rome,Mar,87,1557.39
+Bangalore,May,317,8936.99
+Beijing,Jul,411,11600.67
+San Francisco,Sept,156,6846.34
+Paris,Nov,159,7134.56
+San Francisco,Jan,113,5397.89
+Prague,Dec,333,9894.77
 

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-27 Thread dyozie
Github user dyozie commented on a diff in the pull request:

https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85367943
  
--- Diff: pxf/HivePXF.html.md.erb ---
@@ -2,121 +2,450 @@
 title: Accessing Hive Data
 ---
 
-This topic describes how to access Hive data using PXF. You have several 
options for querying data stored in Hive. You can create external tables in PXF 
and then query those tables, or you can easily query Hive tables by using HAWQ 
and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored 
in HCatalog.
+Apache Hive is a distributed data warehousing infrastructure.  Hive 
facilitates managing large data sets supporting multiple data formats, 
including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive 
plug-in reads data stored in Hive, as well as HDFS or HBase.
+
+This section describes how to use PXF to access Hive data. Options for 
querying data stored in Hive include:
+
+-  Creating an external table in PXF and querying that table
+-  Querying Hive tables via PXF's integration with HCatalog
 
 ## Prerequisites
 
-Check the following before using PXF to access Hive:
+Before accessing Hive data with HAWQ and PXF, ensure that:
 
--   The PXF HDFS plug-in is installed on all cluster nodes.
+-   The PXF HDFS plug-in is installed on all cluster nodes. See 
[Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation 
information.
 -   The PXF Hive plug-in is installed on all cluster nodes.
 -   The Hive JAR files and conf directory are installed on all cluster 
nodes.
--   Test PXF on HDFS before connecting to Hive or HBase.
+-   You have tested PXF on HDFS.
 -   You are running the Hive Metastore service on a machine in your 
cluster. 
 -   You have set the `hive.metastore.uris` property in the 
`hive-site.xml` on the NameNode.
 
+## Hive File Formats
+
+Hive supports several file formats:
+
+-   TextFile - flat file with data in comma-, tab-, or space-separated 
value format or JSON notation
+-   SequenceFile - flat file consisting of binary key/value pairs
+-   RCFile - record columnar data consisting of binary key/value pairs; 
high row compression rate
+-   ORCFile - optimized row columnar data with stripe, footer, and 
postscript sections; reduces data size
+-   Parquet - compressed columnar data representation
+-   Avro - JSON-defined, schema-based data serialization format
+
+Refer to [File 
Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for 
detailed information about the file formats supported by Hive.
+
+The PXF Hive plug-in supports the following profiles for accessing the 
Hive file formats listed above. These include:
+
+- `Hive`
+- `HiveText`
+- `HiveRC`
+
+## Data Type Mapping
+
+### Primitive Data Types
+
+To represent Hive data in HAWQ, map data values that use a primitive data 
type to HAWQ columns of the same type.
+
+The following table summarizes external mapping rules for Hive primitive 
types.
+
+| Hive Data Type  | Hawq Data Type |
+|---|---|
+| boolean| bool |
+| int   | int4 |
+| smallint   | int2 |
+| tinyint   | int2 |
+| bigint   | int8 |
+| decimal  |  numeric  |
+| float   | float4 |
+| double   | float8 |
+| string   | text |
+| binary   | bytea |
+| char   | bpchar |
+| varchar   | varchar |
+| timestamp   | timestamp |
+| date   | date |
+
+
+### Complex Data Types
+
+Hive supports complex data types including array, struct, map, and union. 
PXF maps each of these complex types to `text`.  While HAWQ does not natively 
support these types, you can create HAWQ functions or application code to 
extract subcomponents of these complex data types.
+
+An example using complex data types is provided later in this topic.
+
+
+## Sample Data Set
+
+Examples used in this topic will operate on a common data set. This simple 
data set models a retail sales operation and includes fields with the following 
names and data types:
+
+- location - text
+- month - text
+- number\_of\_orders - integer
+- total\_sales - double
+
+Prepare the sample data set for use:
+
+1. First, create a text file:
+
+```
+$ vi /tmp/pxf_hive_datafile.txt
+```
+
+2. Add the following data to `pxf_hive_datafile.txt`; notice the use of 
the comma `,` to separate the four field values:
+
+```
+Prague,Jan,101,4875.33
+Rome,Mar,87,1557.39
+Bangalore,May,317,8936.99
+Beijing,Jul,411,11600.67
+San Francisco,Sept,156,6846.34
+Paris,Nov,159,7134.56
+San Francisco,Jan,113,5397.89
+Prague,Dec,333,9894.77

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-27 Thread dyozie
Github user dyozie commented on a diff in the pull request:

https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85371576
  
--- Diff: pxf/HivePXF.html.md.erb ---
@@ -151,184 +477,120 @@ To enable HCatalog query integration in HAWQ, 
perform the following steps:
 postgres=# GRANT ALL ON PROTOCOL pxf TO "role";
 ``` 
 
-3.  To query a Hive table with HCatalog integration, simply query HCatalog 
directly from HAWQ. The query syntax is:
 
-``` sql
-postgres=# SELECT * FROM hcatalog.hive-db-name.hive-table-name;
-```
+To query a Hive table with HCatalog integration, query HCatalog directly 
from HAWQ. The query syntax is:
+
+``` sql
+postgres=# SELECT * FROM hcatalog.hive-db-name.hive-table-name;
+```
 
-For example:
+For example:
 
-``` sql
-postgres=# SELECT * FROM hcatalog.default.sales;
-```
-
-4.  To obtain a description of a Hive table with HCatalog integration, you 
can use the `psql` client interface.
--   Within HAWQ, use either the `\d
 hcatalog.hive-db-name.hive-table-name` or `\d+ 
hcatalog.hive-db-name.hive-table-name` commands to describe a 
single table. For example, from the `psql` client interface:
-
-``` shell
-$ psql -d postgres
-postgres=# \d hcatalog.default.test
-
-PXF Hive Table "default.test"
-Column|  Type  
---+
- name | text
- type | text
- supplier_key | int4
- full_price   | float8 
-```
--   Use `\d hcatalog.hive-db-name.*` to describe the whole database 
schema. For example:
-
-``` shell
-postgres=# \d hcatalog.default.*
-
-PXF Hive Table "default.test"
-Column|  Type  
---+
- type | text
- name | text
- supplier_key | int4
- full_price   | float8
-
-PXF Hive Table "default.testabc"
- Column | Type 
-+--
- type   | text
- name   | text
-```
--   Use `\d hcatalog.*.*` to describe the whole schema:
-
-``` shell
-postgres=# \d hcatalog.*.*
-
-PXF Hive Table "default.test"
-Column|  Type  
---+
- type | text
- name | text
- supplier_key | int4
- full_price   | float8
-
-PXF Hive Table "default.testabc"
- Column | Type 
-+--
- type   | text
- name   | text
-
-PXF Hive Table "userdb.test"
-  Column  | Type 
---+--
- address  | text
- username | text
- 
-```
-
-**Note:** When using `\d` or `\d+` commands in the `psql` HAWQ client, 
`hcatalog` will not be listed as a database. If you use other `psql` compatible 
clients, `hcatalog` will be listed as a database with a size value of `-1` 
since `hcatalog` is not a real database in HAWQ.
-
-5.  Alternatively, you can use the **pxf\_get\_item\_fields** user-defined 
function (UDF) to obtain Hive table descriptions from other client interfaces 
or third-party applications. The UDF takes a PXF profile and a table pattern 
string as its input parameters.
-
-**Note:** Currently the only supported input profile is `'Hive'`.
-
-For example, the following statement returns a description of a 
specific table. The description includes path, itemname (table), fieldname, and 
fieldtype.
+``` sql
+postgres=# SELECT * FROM hcatalog.default.sales_info;
+```
+
+To obtain a description of a Hive table with HCatalog integration, you can 
use the `psql` client interface.
+
+-   Within HAWQ, use either the `\d
 hcatalog.hive-db-name.hive-table-name` or `\d+ 
hcatalog.hive-db-name.hive-table-name` commands to describe a single 
table. For example, from the `psql` client interface:
+
+``` shell
+$ psql -d postgres
+```
 
 ``` sql
-postgres=# select * from pxf_get_item_fields('Hive','default.test');
+postgres=# \d hcatalog.default.sales_info_rcfile;
 ```
-
-``` pre
-  path   | itemname |  fieldname   | fieldtype 
--+--+--+---
- default | test | name | text
- default | test | type | text
- default | test | supplier_key | int4
  

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-27 Thread dyozie
Github user dyozie commented on a diff in the pull request:

https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85370681
  
--- Diff: pxf/HivePXF.html.md.erb ---
@@ -151,184 +477,120 @@ To enable HCatalog query integration in HAWQ, 
perform the following steps:
 postgres=# GRANT ALL ON PROTOCOL pxf TO "role";
 ``` 
 
-3.  To query a Hive table with HCatalog integration, simply query HCatalog 
directly from HAWQ. The query syntax is:
 
-``` sql
-postgres=# SELECT * FROM hcatalog.hive-db-name.hive-table-name;
-```
+To query a Hive table with HCatalog integration, query HCatalog directly 
from HAWQ. The query syntax is:
--- End diff --

It's a bit awkward to drop out of the procedure and into free-form 
discussion of the various operations.  I think it might be better to put the 
previous 3-step procedure into a new subsection like "Enabling HCatalog 
Integration" and then putting the remaining non-procedural content into "Usage" 
?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-27 Thread dyozie
Github user dyozie commented on a diff in the pull request:

https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85368752
  
--- Diff: pxf/HivePXF.html.md.erb ---
@@ -2,121 +2,450 @@
 title: Accessing Hive Data
 ---
 
-This topic describes how to access Hive data using PXF. You have several 
options for querying data stored in Hive. You can create external tables in PXF 
and then query those tables, or you can easily query Hive tables by using HAWQ 
and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored 
in HCatalog.
+Apache Hive is a distributed data warehousing infrastructure.  Hive 
facilitates managing large data sets supporting multiple data formats, 
including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive 
plug-in reads data stored in Hive, as well as HDFS or HBase.
+
+This section describes how to use PXF to access Hive data. Options for 
querying data stored in Hive include:
+
+-  Creating an external table in PXF and querying that table
+-  Querying Hive tables via PXF's integration with HCatalog
 
 ## Prerequisites
 
-Check the following before using PXF to access Hive:
+Before accessing Hive data with HAWQ and PXF, ensure that:
 
--   The PXF HDFS plug-in is installed on all cluster nodes.
+-   The PXF HDFS plug-in is installed on all cluster nodes. See 
[Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation 
information.
 -   The PXF Hive plug-in is installed on all cluster nodes.
 -   The Hive JAR files and conf directory are installed on all cluster 
nodes.
--   Test PXF on HDFS before connecting to Hive or HBase.
+-   You have tested PXF on HDFS.
 -   You are running the Hive Metastore service on a machine in your 
cluster. 
 -   You have set the `hive.metastore.uris` property in the 
`hive-site.xml` on the NameNode.
 
+## Hive File Formats
+
+Hive supports several file formats:
+
+-   TextFile - flat file with data in comma-, tab-, or space-separated 
value format or JSON notation
+-   SequenceFile - flat file consisting of binary key/value pairs
+-   RCFile - record columnar data consisting of binary key/value pairs; 
high row compression rate
+-   ORCFile - optimized row columnar data with stripe, footer, and 
postscript sections; reduces data size
+-   Parquet - compressed columnar data representation
+-   Avro - JSON-defined, schema-based data serialization format
+
+Refer to [File 
Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for 
detailed information about the file formats supported by Hive.
+
+The PXF Hive plug-in supports the following profiles for accessing the 
Hive file formats listed above. These include:
+
+- `Hive`
+- `HiveText`
+- `HiveRC`
+
+## Data Type Mapping
+
+### Primitive Data Types
+
+To represent Hive data in HAWQ, map data values that use a primitive data 
type to HAWQ columns of the same type.
+
+The following table summarizes external mapping rules for Hive primitive 
types.
+
+| Hive Data Type  | Hawq Data Type |
+|---|---|
+| boolean| bool |
+| int   | int4 |
+| smallint   | int2 |
+| tinyint   | int2 |
+| bigint   | int8 |
+| decimal  |  numeric  |
+| float   | float4 |
+| double   | float8 |
+| string   | text |
+| binary   | bytea |
+| char   | bpchar |
+| varchar   | varchar |
+| timestamp   | timestamp |
+| date   | date |
+
+
+### Complex Data Types
+
+Hive supports complex data types including array, struct, map, and union. 
PXF maps each of these complex types to `text`.  While HAWQ does not natively 
support these types, you can create HAWQ functions or application code to 
extract subcomponents of these complex data types.
+
+An example using complex data types is provided later in this topic.
+
+
+## Sample Data Set
+
+Examples used in this topic will operate on a common data set. This simple 
data set models a retail sales operation and includes fields with the following 
names and data types:
+
+- location - text
+- month - text
+- number\_of\_orders - integer
+- total\_sales - double
+
+Prepare the sample data set for use:
+
+1. First, create a text file:
+
+```
+$ vi /tmp/pxf_hive_datafile.txt
+```
+
+2. Add the following data to `pxf_hive_datafile.txt`; notice the use of 
the comma `,` to separate the four field values:
+
+```
+Prague,Jan,101,4875.33
+Rome,Mar,87,1557.39
+Bangalore,May,317,8936.99
+Beijing,Jul,411,11600.67
+San Francisco,Sept,156,6846.34
+Paris,Nov,159,7134.56
+San Francisco,Jan,113,5397.89
+Prague,Dec,333,9894.77

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-27 Thread dyozie
Github user dyozie commented on a diff in the pull request:

https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85365959
  
--- Diff: pxf/HivePXF.html.md.erb ---
@@ -2,121 +2,450 @@
 title: Accessing Hive Data
 ---
 
-This topic describes how to access Hive data using PXF. You have several 
options for querying data stored in Hive. You can create external tables in PXF 
and then query those tables, or you can easily query Hive tables by using HAWQ 
and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored 
in HCatalog.
+Apache Hive is a distributed data warehousing infrastructure.  Hive 
facilitates managing large data sets supporting multiple data formats, 
including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive 
plug-in reads data stored in Hive, as well as HDFS or HBase.
+
+This section describes how to use PXF to access Hive data. Options for 
querying data stored in Hive include:
+
+-  Creating an external table in PXF and querying that table
+-  Querying Hive tables via PXF's integration with HCatalog
 
 ## Prerequisites
 
-Check the following before using PXF to access Hive:
+Before accessing Hive data with HAWQ and PXF, ensure that:
 
--   The PXF HDFS plug-in is installed on all cluster nodes.
+-   The PXF HDFS plug-in is installed on all cluster nodes. See 
[Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation 
information.
 -   The PXF Hive plug-in is installed on all cluster nodes.
 -   The Hive JAR files and conf directory are installed on all cluster 
nodes.
--   Test PXF on HDFS before connecting to Hive or HBase.
+-   You have tested PXF on HDFS.
 -   You are running the Hive Metastore service on a machine in your 
cluster. 
 -   You have set the `hive.metastore.uris` property in the 
`hive-site.xml` on the NameNode.
 
+## Hive File Formats
+
+Hive supports several file formats:
+
+-   TextFile - flat file with data in comma-, tab-, or space-separated 
value format or JSON notation
+-   SequenceFile - flat file consisting of binary key/value pairs
+-   RCFile - record columnar data consisting of binary key/value pairs; 
high row compression rate
+-   ORCFile - optimized row columnar data with stripe, footer, and 
postscript sections; reduces data size
+-   Parquet - compressed columnar data representation
+-   Avro - JSON-defined, schema-based data serialization format
+
+Refer to [File 
Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for 
detailed information about the file formats supported by Hive.
+
+The PXF Hive plug-in supports the following profiles for accessing the 
Hive file formats listed above. These include:
+
+- `Hive`
+- `HiveText`
+- `HiveRC`
+
+## Data Type Mapping
+
+### Primitive Data Types
+
+To represent Hive data in HAWQ, map data values that use a primitive data 
type to HAWQ columns of the same type.
+
+The following table summarizes external mapping rules for Hive primitive 
types.
+
+| Hive Data Type  | Hawq Data Type |
+|---|---|
+| boolean| bool |
+| int   | int4 |
+| smallint   | int2 |
+| tinyint   | int2 |
+| bigint   | int8 |
+| decimal  |  numeric  |
+| float   | float4 |
+| double   | float8 |
+| string   | text |
+| binary   | bytea |
+| char   | bpchar |
+| varchar   | varchar |
+| timestamp   | timestamp |
+| date   | date |
+
+
+### Complex Data Types
+
+Hive supports complex data types including array, struct, map, and union. 
PXF maps each of these complex types to `text`.  While HAWQ does not natively 
support these types, you can create HAWQ functions or application code to 
extract subcomponents of these complex data types.
+
+An example using complex data types is provided later in this topic.
+
+
+## Sample Data Set
+
+Examples used in this topic will operate on a common data set. This simple 
data set models a retail sales operation and includes fields with the following 
names and data types:
+
+- location - text
+- month - text
+- number\_of\_orders - integer
+- total\_sales - double
--- End diff --

Also consider term/definition table here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-27 Thread dyozie
Github user dyozie commented on a diff in the pull request:

https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85366470
  
--- Diff: pxf/HivePXF.html.md.erb ---
@@ -2,121 +2,450 @@
 title: Accessing Hive Data
 ---
 
-This topic describes how to access Hive data using PXF. You have several 
options for querying data stored in Hive. You can create external tables in PXF 
and then query those tables, or you can easily query Hive tables by using HAWQ 
and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored 
in HCatalog.
+Apache Hive is a distributed data warehousing infrastructure.  Hive 
facilitates managing large data sets supporting multiple data formats, 
including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive 
plug-in reads data stored in Hive, as well as HDFS or HBase.
+
+This section describes how to use PXF to access Hive data. Options for 
querying data stored in Hive include:
+
+-  Creating an external table in PXF and querying that table
+-  Querying Hive tables via PXF's integration with HCatalog
 
 ## Prerequisites
 
-Check the following before using PXF to access Hive:
+Before accessing Hive data with HAWQ and PXF, ensure that:
 
--   The PXF HDFS plug-in is installed on all cluster nodes.
+-   The PXF HDFS plug-in is installed on all cluster nodes. See 
[Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation 
information.
 -   The PXF Hive plug-in is installed on all cluster nodes.
 -   The Hive JAR files and conf directory are installed on all cluster 
nodes.
--   Test PXF on HDFS before connecting to Hive or HBase.
+-   You have tested PXF on HDFS.
 -   You are running the Hive Metastore service on a machine in your 
cluster. 
 -   You have set the `hive.metastore.uris` property in the 
`hive-site.xml` on the NameNode.
 
+## Hive File Formats
+
+Hive supports several file formats:
+
+-   TextFile - flat file with data in comma-, tab-, or space-separated 
value format or JSON notation
+-   SequenceFile - flat file consisting of binary key/value pairs
+-   RCFile - record columnar data consisting of binary key/value pairs; 
high row compression rate
+-   ORCFile - optimized row columnar data with stripe, footer, and 
postscript sections; reduces data size
+-   Parquet - compressed columnar data representation
+-   Avro - JSON-defined, schema-based data serialization format
+
+Refer to [File 
Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for 
detailed information about the file formats supported by Hive.
+
+The PXF Hive plug-in supports the following profiles for accessing the 
Hive file formats listed above. These include:
+
+- `Hive`
+- `HiveText`
+- `HiveRC`
+
+## Data Type Mapping
+
+### Primitive Data Types
+
+To represent Hive data in HAWQ, map data values that use a primitive data 
type to HAWQ columns of the same type.
+
+The following table summarizes external mapping rules for Hive primitive 
types.
+
+| Hive Data Type  | Hawq Data Type |
+|---|---|
+| boolean| bool |
+| int   | int4 |
+| smallint   | int2 |
+| tinyint   | int2 |
+| bigint   | int8 |
+| decimal  |  numeric  |
+| float   | float4 |
+| double   | float8 |
+| string   | text |
+| binary   | bytea |
+| char   | bpchar |
+| varchar   | varchar |
+| timestamp   | timestamp |
+| date   | date |
+
+
+### Complex Data Types
+
+Hive supports complex data types including array, struct, map, and union. 
PXF maps each of these complex types to `text`.  While HAWQ does not natively 
support these types, you can create HAWQ functions or application code to 
extract subcomponents of these complex data types.
+
+An example using complex data types is provided later in this topic.
+
+
+## Sample Data Set
+
+Examples used in this topic will operate on a common data set. This simple 
data set models a retail sales operation and includes fields with the following 
names and data types:
+
+- location - text
+- month - text
+- number\_of\_orders - integer
+- total\_sales - double
+
+Prepare the sample data set for use:
+
+1. First, create a text file:
+
+```
+$ vi /tmp/pxf_hive_datafile.txt
+```
+
+2. Add the following data to `pxf_hive_datafile.txt`; notice the use of 
the comma `,` to separate the four field values:
+
+```
+Prague,Jan,101,4875.33
+Rome,Mar,87,1557.39
+Bangalore,May,317,8936.99
+Beijing,Jul,411,11600.67
+San Francisco,Sept,156,6846.34
+Paris,Nov,159,7134.56
+San Francisco,Jan,113,5397.89
+Prague,Dec,333,9894.77

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-27 Thread dyozie
Github user dyozie commented on a diff in the pull request:

https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85367789
  
--- Diff: pxf/HivePXF.html.md.erb ---
@@ -2,121 +2,450 @@
 title: Accessing Hive Data
 ---
 
-This topic describes how to access Hive data using PXF. You have several 
options for querying data stored in Hive. You can create external tables in PXF 
and then query those tables, or you can easily query Hive tables by using HAWQ 
and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored 
in HCatalog.
+Apache Hive is a distributed data warehousing infrastructure.  Hive 
facilitates managing large data sets supporting multiple data formats, 
including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive 
plug-in reads data stored in Hive, as well as HDFS or HBase.
+
+This section describes how to use PXF to access Hive data. Options for 
querying data stored in Hive include:
+
+-  Creating an external table in PXF and querying that table
+-  Querying Hive tables via PXF's integration with HCatalog
 
 ## Prerequisites
 
-Check the following before using PXF to access Hive:
+Before accessing Hive data with HAWQ and PXF, ensure that:
 
--   The PXF HDFS plug-in is installed on all cluster nodes.
+-   The PXF HDFS plug-in is installed on all cluster nodes. See 
[Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation 
information.
 -   The PXF Hive plug-in is installed on all cluster nodes.
 -   The Hive JAR files and conf directory are installed on all cluster 
nodes.
--   Test PXF on HDFS before connecting to Hive or HBase.
+-   You have tested PXF on HDFS.
 -   You are running the Hive Metastore service on a machine in your 
cluster. 
 -   You have set the `hive.metastore.uris` property in the 
`hive-site.xml` on the NameNode.
 
+## Hive File Formats
+
+Hive supports several file formats:
+
+-   TextFile - flat file with data in comma-, tab-, or space-separated 
value format or JSON notation
+-   SequenceFile - flat file consisting of binary key/value pairs
+-   RCFile - record columnar data consisting of binary key/value pairs; 
high row compression rate
+-   ORCFile - optimized row columnar data with stripe, footer, and 
postscript sections; reduces data size
+-   Parquet - compressed columnar data representation
+-   Avro - JSON-defined, schema-based data serialization format
+
+Refer to [File 
Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for 
detailed information about the file formats supported by Hive.
+
+The PXF Hive plug-in supports the following profiles for accessing the 
Hive file formats listed above. These include:
+
+- `Hive`
+- `HiveText`
+- `HiveRC`
+
+## Data Type Mapping
+
+### Primitive Data Types
+
+To represent Hive data in HAWQ, map data values that use a primitive data 
type to HAWQ columns of the same type.
+
+The following table summarizes external mapping rules for Hive primitive 
types.
+
+| Hive Data Type  | Hawq Data Type |
+|---|---|
+| boolean| bool |
+| int   | int4 |
+| smallint   | int2 |
+| tinyint   | int2 |
+| bigint   | int8 |
+| decimal  |  numeric  |
+| float   | float4 |
+| double   | float8 |
+| string   | text |
+| binary   | bytea |
+| char   | bpchar |
+| varchar   | varchar |
+| timestamp   | timestamp |
+| date   | date |
+
+
+### Complex Data Types
+
+Hive supports complex data types including array, struct, map, and union. 
PXF maps each of these complex types to `text`.  While HAWQ does not natively 
support these types, you can create HAWQ functions or application code to 
extract subcomponents of these complex data types.
+
+An example using complex data types is provided later in this topic.
+
+
+## Sample Data Set
+
+Examples used in this topic will operate on a common data set. This simple 
data set models a retail sales operation and includes fields with the following 
names and data types:
+
+- location - text
+- month - text
+- number\_of\_orders - integer
+- total\_sales - double
+
+Prepare the sample data set for use:
+
+1. First, create a text file:
+
+```
+$ vi /tmp/pxf_hive_datafile.txt
+```
+
+2. Add the following data to `pxf_hive_datafile.txt`; notice the use of 
the comma `,` to separate the four field values:
+
+```
+Prague,Jan,101,4875.33
+Rome,Mar,87,1557.39
+Bangalore,May,317,8936.99
+Beijing,Jul,411,11600.67
+San Francisco,Sept,156,6846.34
+Paris,Nov,159,7134.56
+San Francisco,Jan,113,5397.89
+Prague,Dec,333,9894.77

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-27 Thread dyozie
Github user dyozie commented on a diff in the pull request:

https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85372086
  
--- Diff: pxf/HivePXF.html.md.erb ---
@@ -2,121 +2,450 @@
 title: Accessing Hive Data
 ---
 
-This topic describes how to access Hive data using PXF. You have several 
options for querying data stored in Hive. You can create external tables in PXF 
and then query those tables, or you can easily query Hive tables by using HAWQ 
and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored 
in HCatalog.
+Apache Hive is a distributed data warehousing infrastructure.  Hive 
facilitates managing large data sets supporting multiple data formats, 
including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive 
plug-in reads data stored in Hive, as well as HDFS or HBase.
+
+This section describes how to use PXF to access Hive data. Options for 
querying data stored in Hive include:
+
+-  Creating an external table in PXF and querying that table
+-  Querying Hive tables via PXF's integration with HCatalog
 
 ## Prerequisites
 
-Check the following before using PXF to access Hive:
+Before accessing Hive data with HAWQ and PXF, ensure that:
 
--   The PXF HDFS plug-in is installed on all cluster nodes.
+-   The PXF HDFS plug-in is installed on all cluster nodes. See 
[Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation 
information.
 -   The PXF Hive plug-in is installed on all cluster nodes.
 -   The Hive JAR files and conf directory are installed on all cluster 
nodes.
--   Test PXF on HDFS before connecting to Hive or HBase.
+-   You have tested PXF on HDFS.
 -   You are running the Hive Metastore service on a machine in your 
cluster. 
 -   You have set the `hive.metastore.uris` property in the 
`hive-site.xml` on the NameNode.
 
+## Hive File Formats
+
+Hive supports several file formats:
+
+-   TextFile - flat file with data in comma-, tab-, or space-separated 
value format or JSON notation
+-   SequenceFile - flat file consisting of binary key/value pairs
+-   RCFile - record columnar data consisting of binary key/value pairs; 
high row compression rate
+-   ORCFile - optimized row columnar data with stripe, footer, and 
postscript sections; reduces data size
+-   Parquet - compressed columnar data representation
+-   Avro - JSON-defined, schema-based data serialization format
+
+Refer to [File 
Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for 
detailed information about the file formats supported by Hive.
+
+The PXF Hive plug-in supports the following profiles for accessing the 
Hive file formats listed above. These include:
+
+- `Hive`
+- `HiveText`
+- `HiveRC`
+
+## Data Type Mapping
+
+### Primitive Data Types
+
+To represent Hive data in HAWQ, map data values that use a primitive data 
type to HAWQ columns of the same type.
+
+The following table summarizes external mapping rules for Hive primitive 
types.
+
+| Hive Data Type  | Hawq Data Type |
+|---|---|
+| boolean| bool |
+| int   | int4 |
+| smallint   | int2 |
+| tinyint   | int2 |
+| bigint   | int8 |
+| decimal  |  numeric  |
+| float   | float4 |
+| double   | float8 |
+| string   | text |
+| binary   | bytea |
+| char   | bpchar |
+| varchar   | varchar |
+| timestamp   | timestamp |
+| date   | date |
+
+
+### Complex Data Types
+
+Hive supports complex data types including array, struct, map, and union. 
PXF maps each of these complex types to `text`.  While HAWQ does not natively 
support these types, you can create HAWQ functions or application code to 
extract subcomponents of these complex data types.
+
+An example using complex data types is provided later in this topic.
+
+
+## Sample Data Set
+
+Examples used in this topic will operate on a common data set. This simple 
data set models a retail sales operation and includes fields with the following 
names and data types:
+
+- location - text
+- month - text
+- number\_of\_orders - integer
+- total\_sales - double
+
+Prepare the sample data set for use:
+
+1. First, create a text file:
+
+```
+$ vi /tmp/pxf_hive_datafile.txt
+```
+
+2. Add the following data to `pxf_hive_datafile.txt`; notice the use of 
the comma `,` to separate the four field values:
+
+```
+Prague,Jan,101,4875.33
+Rome,Mar,87,1557.39
+Bangalore,May,317,8936.99
+Beijing,Jul,411,11600.67
+San Francisco,Sept,156,6846.34
+Paris,Nov,159,7134.56
+San Francisco,Jan,113,5397.89
+Prague,Dec,333,9894.77

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

2016-10-26 Thread lisakowen
GitHub user lisakowen opened a pull request:

https://github.com/apache/incubator-hawq-docs/pull/39

HAWQ-1071 - add examples for HiveText and HiveRC plugins

added examples, restructured content, added hive command line section.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lisakowen/incubator-hawq-docs 
feature/pxfhive-enhance

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hawq-docs/pull/39.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #39


commit 0398a62fefd3627273927f938b4d082a25bf3003
Author: Lisa Owen 
Date:   2016-09-26T21:37:04Z

restructure PXF Hive pulug-in page; add more relevant examples

commit 457d703a3f5c057e241acf985fbc35da34f6a075
Author: Lisa Owen 
Date:   2016-09-26T22:40:10Z

PXF Hive plug-in mods

commit 822d7545e746490e55507866c62dca5ea2d5349a
Author: Lisa Owen 
Date:   2016-10-03T22:19:03Z

clean up some extra whitespace

commit 8c986b60b8db3edd77c10f23704cc9174c52a803
Author: Lisa Owen 
Date:   2016-10-11T18:37:34Z

include list of hive profile names in file format section

commit 150fa67857871d58ea05eb14c023215c932ab7b1
Author: Lisa Owen 
Date:   2016-10-11T19:03:39Z

link to CREATE EXTERNAL TABLE ref page

commit 5cdd8f8c35a51360fe3bfdedeff796bf1e0f31f3
Author: Lisa Owen 
Date:   2016-10-11T20:27:17Z

sql commands all caps

commit 67e8b9699c9eec64d04ce9e6048ffb385f7f3573
Author: Lisa Owen 
Date:   2016-10-11T20:33:35Z

use <> for optional args

commit 54b2c01a80d477cc093d7eb1ed2aa8c0bf762d36
Author: Lisa Owen 
Date:   2016-10-22T00:16:24Z

fix some duplicate ids

commit 284c3ec2db38e8d9020826e3bf292efad76c1819
Author: Lisa Owen 
Date:   2016-10-26T15:38:37Z

restructure to use numbered steps

commit 2a38a0322abda804cfd4fc8aa39f142f0d83ea11
Author: Lisa Owen 
Date:   2016-10-26T17:20:28Z

note/notice




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---