[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

dyozie Thu, 27 Oct 2016 09:12:30 -0700

Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85368752
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -2,121 +2,450 @@
     title: Accessing Hive Data
     ---
     
    -This topic describes how to access Hive data using PXF. You have several 
options for querying data stored in Hive. You can create external tables in PXF 
and then query those tables, or you can easily query Hive tables by using HAWQ 
and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored 
in HCatalog.
    +Apache Hive is a distributed data warehousing infrastructure.  Hive 
facilitates managing large data sets supporting multiple data formats, 
including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive 
plug-in reads data stored in Hive, as well as HDFS or HBase.
    +
    +This section describes how to use PXF to access Hive data. Options for 
querying data stored in Hive include:
    +
    +-  Creating an external table in PXF and querying that table
    +-  Querying Hive tables via PXF's integration with HCatalog
     
     ## <a id="installingthepxfhiveplugin"></a>Prerequisites
     
    -Check the following before using PXF to access Hive:
    +Before accessing Hive data with HAWQ and PXF, ensure that:
     
    --   The PXF HDFS plug-in is installed on all cluster nodes.
    +-   The PXF HDFS plug-in is installed on all cluster nodes. See 
[Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation 
information.
     -   The PXF Hive plug-in is installed on all cluster nodes.
     -   The Hive JAR files and conf directoryÂ are installed on all cluster 
nodes.
    --   Test PXF on HDFS before connecting to Hive or HBase.
    +-   You have tested PXF on HDFS.
     -   You are running the Hive Metastore service on a machine in your 
cluster.Â 
     -   You have set the `hive.metastore.uris`Â property in theÂ 
`hive-site.xml` on the NameNode.
     
    +## <a id="topic_p2s_lvl_25"></a>Hive File Formats
    +
    +Hive supports several file formats:
    +
    +-   TextFile - flat file with data in comma-, tab-, or space-separated 
value format or JSON notation
    +-   SequenceFile - flat file consisting of binary key/value pairs
    +-   RCFile - record columnar data consisting of binary key/value pairs; 
high row compression rate
    +-   ORCFile - optimized row columnar data with stripe, footer, and 
postscript sections; reduces data size
    +-   Parquet - compressed columnar data representation
    +-   Avro - JSON-defined, schema-based data serialization format
    +
    +Refer to [File 
Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for 
detailed information about the file formats supported by Hive.
    +
    +The PXF Hive plug-in supports the following profiles for accessing the 
Hive file formats listed above. These include:
    +
    +- `Hive`
    +- `HiveText`
    +- `HiveRC`
    +
    +## <a id="topic_p2s_lvl_29"></a>Data Type Mapping
    +
    +### <a id="hive_primdatatypes"></a>Primitive Data Types
    +
    +To represent Hive data in HAWQ, map data values that use a primitive data 
type to HAWQ columns of the same type.
    +
    +The following table summarizes external mapping rules for Hive primitive 
types.
    +
    +| Hive Data Type  | Hawq Data Type |
    +|-------|---------------------------|
    +| boolean    | bool |
    +| int   | int4 |
    +| smallint   | int2 |
    +| tinyint   | int2 |
    +| bigint   | int8 |
    +| decimal  |  numeric  |
    +| float   | float4 |
    +| double   | float8 |
    +| string   | text |
    +| binary   | bytea |
    +| char   | bpchar |
    +| varchar   | varchar |
    +| timestamp   | timestamp |
    +| date   | date |
    +
    +
    +### <a id="topic_b4v_g3n_25"></a>Complex Data Types
    +
    +Hive supports complex data types including array, struct, map, and union. 
PXF maps each of these complex types to `text`.  While HAWQ does not natively 
support these types, you can create HAWQ functions or application code to 
extract subcomponents of these complex data types.
    +
    +An example using complex data types is provided later in this topic.
    +
    +
    +## <a id="hive_sampledataset"></a>Sample Data Set
    +
    +Examples used in this topic will operate on a common data set. This simple 
data set models a retail sales operation and includes fields with the following 
names and data types:
    +
    +- location - text
    +- month - text
    +- number\_of\_orders - integer
    +- total\_sales - double
    +
    +Prepare the sample data set for use:
    +
    +1. First, create a text file:
    +
    +    ```
    +    $ vi /tmp/pxf_hive_datafile.txt
    +    ```
    +
    +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of 
the comma `,` to separate the four field values:
    +
    +    ```
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    San Francisco,Sept,156,6846.34
    +    Paris,Nov,159,7134.56
    +    San Francisco,Jan,113,5397.89
    +    Prague,Dec,333,9894.77
    +    Bangalore,Jul,271,8320.55
    +    Beijing,Dec,100,4248.41
    +    ```
    +
    +Make note of the path to `pxf_hive_datafile.txt`; you will use it in later 
exercises.
    +
    +
     ## <a id="hivecommandline"></a>Hive Command Line
     
    -To start the Hive command line and work directly on a Hive table:
    +The Hive command line is a subsystem similar to that of `psql`. To start 
the Hive command line:
     
     ``` shell
    -$ hive
    +$ HADOOP_USER_NAME=hdfs hive
     ```
     
    -Here is an exampleÂ of how to create and load data intoÂ a sample Hive 
table from an existing file.
    +The default Hive database is named `default`. 
     
    -``` sql
    -hive> CREATE TABLE test (name string, type string, supplier_key int, 
full_price double) row format delimited fields terminated by ',';
    -hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
    -```
    +### <a id="hivecommandline_createdb"></a>Example: Create a Hive Database
     
    -## <a id="topic_p2s_lvl_25"></a>Using PXF Tables to Query Hive
    +Create a Hive table to expose our sample data set.
     
    -Hive tables are defined in a specific way in PXF, regardless of the 
underlying file storage format. The PXF Hive plug-ins automatically detect 
source tables in the following formats:
    +1. Create a Hive table named `sales_info` in the `default` database:
     
    --   Text based
    --   SequenceFile
    --   RCFile
    --   ORCFile
    --   Parquet
    --   Avro
    +    ``` sql
    +    hive> CREATE TABLE sales_info (location string, month string,
    +            number_of_orders int, total_sales double)
    +            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +            STORED AS textfile;
    +    ```
     
    -The source table can also be a combination of these types. The PXF Hive 
plug-inÂ uses this information to query the data in runtime.
    +    Notice that:
     
    --   **[Syntax Example](../pxf/HivePXF.html#syntax2)**
    +    - The `STORED AS textfile` subclause instructs Hive to create the 
table in Textfile (the default) format.  Hive Textfile format supports comma-, 
tab-, and space-separated values, as well as data specified in JSON notation.
    +    - The `DELIMITED FIELDS TERMINATED BY` subclause identifies the field 
delimiter within a data record (line). The `sales_info` table field delimiter 
is a comma (`,`).
     
    --   **[Hive Complex Types](../pxf/HivePXF.html#topic_b4v_g3n_25)**
    +2. Load the `pxf_hive_datafile.txt` sample data file into the `sales_info` 
table you just created:
     
    -### <a id="syntax2"></a>Syntax Example
    +    ``` sql
    +    hive> LOAD DATA local INPATH '/tmp/pxf_hive_datafile.txt'
    +            INTO TABLE sales_info;
    +    ```
    +
    +3. Perform a query on `sales_info` to verify the data was loaded 
successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info;
    +    ```
     
    -The followingÂ PXF table definition is valid for any Hive file storage 
type.
    +In examples later in this section, you will access the `sales_info` Hive 
table directly via PXF. You will also insert `sales_info` data into tables of 
other Hive file format types, and use PXF to access those directly as well.
    +
    +## <a id="topic_p2s_lvl_28"></a>Querying External Hive Data
    +
    +The PXF Hive plug-in supports several Hive-related profiles. These include 
`Hive`, `HiveText`, and `HiveRC`.
    +
    +Use the following syntax to create a HAWQ external table representing Hive 
data:
     
     ``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
    -    ( column_name data_type [, ...] | LIKE other_table )
    -LOCATION ('pxf://namenode[:port]/hive-db-name.hive-table-name?<pxf 
parameters>[&custom-option=value...]')FORMAT 'CUSTOM' 
(formatter='pxfwritable_import')
    +CREATE EXTERNAL TABLE <table_name>
    +    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    +LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name>
    +    ?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=<delim>'])
    +FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')
     ```
     
    -where `<pxf parameters>` is:
    +Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL 
TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described below.
     
    -``` pre
    -   
FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
    - | PROFILE=profile-name
    -```
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| \<host\>[:<port\>]    | The HDFS NameNode and port. |
    +| \<hive-db-name\>    | Name of the Hive database. If omitted, defaults to 
the Hive database named `default`. |
    +| \<hive-table-name\>    | Name of the Hive table. |
    +| PROFILE    | The `PROFILE` keyword must specify one of the values 
`Hive`, `HiveText`, or `HiveRC`. |
    +| DELIMITER    | The `DELIMITER` clause is required for both the 
`HiveText` and `HiveRC` profiles and identifies the field delimiter used in the 
Hive data set.  \<delim\> must be a single ascii character or specified in 
hexadecimal representation. |
    +| FORMAT (`Hive` profile)   | The `FORMAT` clause must specify `CUSTOM`. 
The `CUSTOM` format supports only the built-in `pxfwritable_import` 
`formatter`.   |
    +| FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must 
specify `TEXT`. The `delimiter` must be specified a second time in '\<delim\>'. 
|
     
     
    -If `hive-db-name` is omitted, pxf will default to the Hive `default` 
database.
    +## <a id="profile_hive"></a>Hive Profile
     
    -**Note:** The port is the connection port for the PXF service. If the port 
is omitted, PXF assumes that High Availability (HA) is enabled and connects to 
the HA name service port, 51200 by default. The HA name service port can be 
changed by setting the pxf\_service\_port configuration parameter.
    +The `Hive` profile works with any Hive file format.
     
    -PXF has three built-in profiles for Hive tables:
    +### <a id="profile_hive_using"></a>Example: Using the Hive Profile
     
    --   Hive
    --   HiveRC
    --   HiveText
    +Use the `Hive` profile to create a queryable HAWQ external table from the 
Hive `sales_info` textfile format table created earlier.
     
    -The Hive profile works with any Hive storage type. 
    -The following example creates a readable HAWQ external table representing 
a Hive table named `accessories` in the `inventory` Hive database using the PXF 
Hive profile:
    +1. Create a queryable HAWQ external table from the Hive `sales_info` 
textfile format table created earlier:
     
    -``` shell
    -$ psql -d postgres
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hiveprofile(location text, 
month text, num_orders int, total_sales float8)
    +                LOCATION 
('pxf://namenode:51200/default.sales_info?PROFILE=Hive')
    +              FORMAT 'custom' (formatter='pxfwritable_import');
    +    ```
    +
    +2. Query the table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hiveprofile;
    +    ```
    +
    +    ``` shell
    +       location    | month | num_orders | total_sales
    +    ---------------+-------+------------+-------------
    +     Prague        | Jan   |        101 |     4875.33
    +     Rome          | Mar   |         87 |     1557.39
    +     Bangalore     | May   |        317 |     8936.99
    +     ...
    +
    +    ```
    +
    +## <a id="profile_hivetext"></a>HiveText Profile
    +
    +Use the `HiveText` profile to query text formats. The `HiveText` profile 
is more performant than the `Hive` profile.
    +
    +**Note**: When using the `HiveText` profile, you *must* specify a 
delimiter option in *both* the `LOCATION` and `FORMAT` clauses.
    +
    +### <a id="profile_hivetext_using"></a>Example: Using the HiveText Profile
    +
    +Use the PXF `HiveText` profile to create a queryable HAWQ external table 
from the Hive `sales_info` textfile format table created earlier.
    +
    +1. Create the external table:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivetextprofile(location 
text, month text, num_orders int, total_sales float8)
    +                 LOCATION 
('pxf://namenode:51200/default.sales_info?PROFILE=HiveText&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (You can safely ignore the "nonstandard use of escape in a string 
literal" warning and related messages.)
    +
    +    Notice that:
    +
    +    - The `LOCATION` subclause `DELIMITER` value is specified in 
hexadecimal format. `\x` is a prefix that instructs PXF to interpret the 
following characters as hexadecimal. `2c` is the hex value for the comma 
character.
    +    - The `FORMAT` subclause `delimiter` value is specified as the single 
ascii comma character ','. `E` escapes the character.
    +
    +2. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hivetextprofile where 
location="Beijing";
    +    ```
    +
    +    ``` shell
    +     location | month | num_orders | total_sales
    +    ----------+-------+------------+-------------
    +     Beijing  | Jul   |        411 |    11600.67
    +     Beijing  | Dec   |        100 |     4248.41
    +    (2 rows)
    +    ```
    +
    +## <a id="profile_hiverc"></a>HiveRC Profile
    +
    +The RCFile Hive format is used for row columnar formatted data. The 
`HiveRC` profile provides access to RCFile data.
    +
    +### <a id="profile_hiverc_rcfiletbl_using"></a>Example: Using the HiveRC 
Profile
    +
    +Use the `HiveRC` profile to query RCFile-formatted data in Hive tables. 
The `HiveRC` profile is more performant than the `Hive` profile for this file 
format type.
    +
    +1. Create a Hive table with RCFile format:
    +
    +    ``` shell
    +    $ HADOOP_USER_NAME=hdfs hive
    +    ```
    +
    +    ``` sql
    +    hive> CREATE TABLE sales_info_rcfile (location string, month string,
    +            number_of_orders int, total_sales double)
    +          ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +          STORED AS rcfile;
    +    ```
    +
    +2. Insert the data from the `sales_info` table into `sales_info_rcfile`:
    +
    +    ``` sql
    +    hive> INSERT INTO TABLE sales_info_rcfile SELECT * FROM sales_info;
    +    ```
    +
    +    A copy of the sample data set is now stored in RCFile format in 
`sales_info_rcfile`. 
    +    
    +3. Perform a Hive query on `sales_info_rcfile` to verify the data was 
loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info_rcfile;
    +    ```
    +
    +4. Use the PXF `HiveRC` profile to create a queryable HAWQ external table 
from the Hive `sales_info_rcfile` table created in the previous step. When 
using the `HiveRC` profile, you **must** specify a delimiter option in *both* 
the `LOCATION` and `FORMAT` clauses.:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivercprofile(location 
text, month text, num_orders int, total_sales float8)
    +                 LOCATION 
('pxf://namenode:51200/default.sales_info_rcfile?PROFILE=HiveRC&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (Again, you can safely ignore the "nonstandard use of escape in a 
string literal" warning and related messages.)
    +
    +5. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT location, total_sales FROM salesinfo_hivercprofile;
    +    ```
    +
    +    ``` shell
    +       location    | total_sales
    +    ---------------+-------------
    +     Prague        |     4875.33
    +     Rome          |     1557.39
    +     Bangalore     |     8936.99
    +     Beijing       |    11600.67
    +     ...
    +    ```
    +
    +## <a id="topic_dbb_nz3_ts"></a>Accessing Parquet-Format Hive Tables
    +
    +The PXF `Hive` profile supports both non-partitioned and partitioned Hive 
tables that use the Parquet storage format in HDFS. Simply map the table 
columns using equivalent HAWQ data types. For example, if a Hive table is 
created using:
    +
    +``` sql
    +hive> CREATE TABLE hive_parquet_table (fname string, lname string, custid 
int, acctbalance double)
    +        STORED AS parquet;
     ```
     
    +Define the HAWQ external table using:
    +
     ``` sql
    -postgres=# CREATE EXTERNAL TABLE hivetest(id int, newid int)
    -LOCATION ('pxf://namenode:51200/inventory.accessories?PROFILE=Hive')
    -FORMAT 'custom' (formatter='pxfwritable_import');
    +postgres=# CREATE EXTERNAL TABLE pxf_parquet_table (fname text, lname 
text, custid int, acctbalance double precision)
    +    LOCATION 
('pxf://namenode:51200/hive-db-name.hive_parquet_table?profile=Hive')
    +    FORMAT 'CUSTOM' (formatter='pxfwritable_import');
     ```
     
    +## <a id="profileperf"></a>Profile Performance Considerations
     
    -Use HiveRC and HiveText to query RC and Text formats respectively. The 
HiveRC and HiveText profiles are faster than the generic Hive profile. When 
using the HiveRC and HiveText profiles, you must specify a DELIMITER option in 
the LOCATION clause. See [Using Profiles to Read and Write 
Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information on 
profiles.
    +The `HiveRC` and `HiveText` profiles are faster than the generic `Hive` 
profile.
     
    +?? MORE HERE. ??
    --- End diff --
    
    Need to remove this comment.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Reply via email to