[GitHub] incubator-hawq-docs pull request #17: Updates for hawq register

dyozie Fri, 30 Sep 2016 11:56:47 -0700

Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/17#discussion_r81397353
  
    --- Diff: reference/cli/admin_utilities/hawqregister.html.md.erb ---
    @@ -2,102 +2,83 @@
     title: hawq register
     ---
     
    -Loads and registers external parquet-formatted data in HDFS into a 
corresponding table in HAWQ.
    +Loads and registers 
    +AO or Parquet-formatted data in HDFS into a corresponding table in HAWQ.
     
     ## <a id="topic1__section2"></a>Synopsis
     
     ``` pre
    -hawq register <databasename> <tablename> <hdfspath> 
    +Usage 1:
    +hawq register [<connection_options>] [-f <hdfsfilepath>] [-e <Eof>] 
<tablename>
    +
    +Usage 2:
    +hawq register [<connection_options>] [-c <configfilepath>][--force] 
<tablename>
    +
    +Connection Options:
          [-h <hostname>] 
          [-p <port>] 
          [-U <username>] 
          [-d <database>]
    -     [-t <tablename>] 
    +     
    +Misc. Options:
          [-f <filepath>] 
    +    [-e <eof>]
    +    [--force] 
          [-c <yml_config>]  
     hawq register help | -? 
     hawq register --version
     ```
     
     ## <a id="topic1__section3"></a>Prerequisites
     
    -The client machine where `hawq register` is executed must have the 
following:
    +The client machine where `hawq register` is executed must meet the 
following conditions:
     
     -   Network access to and from all hosts in your HAWQ cluster (master and 
segments) and the hosts where the data to be loaded is located.
    +-   The Hadoop client must be configured and the hdfs filepath specified.
     -   The files to be registered and the HAWQ table located in the same HDFS 
cluster.
     -   The target table DDL is configured with the correct data type mapping.
     
     ## <a id="topic1__section4"></a>Description
     
    -`hawq register` is a utility that loads and registers existing or external 
parquet data in HDFS into HAWQ, so that it can be directly ingested and 
accessed through HAWQ. Parquet data from the file or directory in the specified 
path is loaded into the appropriate HAWQ table directory in HDFS and the 
utility updates the corresponding HAWQ metadata for the files. 
    +`hawq register` is a utility that loads and registers existing data files 
or folders in HDFS into HAWQ internal tables, allowing HAWQ to directly read 
the data and use internal table processing for operations such as transactions 
and high perforance, without needing to load or copy it. Data from the file or 
directory specified by \<hdfsfilepath\> is loaded into the appropriate HAWQ 
table directory in HDFS and the utility updates the corresponding HAWQ metadata 
for the files. 
     
    -Only parquet tables can be loaded using the `hawq register` command. 
Metadata for the parquet file(s) and the destination table must be consistent. 
Different  data types are used by HAWQ tables and parquet tables, so the data 
is mapped. You must verify that the structure of the parquet files and the HAWQ 
table are compatible before running `hawq register`. 
    +You can use `hawq register` to:
     
    -Note: only HAWQ or HIVE-generated parquet tables are currently supported.
    +-  Load and register external Parquet-formatted file data generated by an 
external system such as Hive or Spark.
    +-  Recover cluster data from a backup cluster.
     
    -###Limitations for Registering Hive Tables to HAWQ
    -The currently-supported data types for generating Hive tables into HAWQ 
tables are: boolean, int, smallint, tinyint, bigint, float, double, string, 
binary, char, and varchar.  
    +Two usage models are available.
     
    -The following HIVE data types cannot be converted to HAWQ equivalents: 
timestamp, decimal, array, struct, map, and union.   
    +###Usage Model 1: register file data to an existing table.
     
    +`hawq register [-h hostname] [-p port] [-U username] [-d databasename] [-f 
filepath] [-e eof]<tablename>`
     
    -## <a id="topic1__section5"></a>Options
    -
    -**General Options**
    -
    -<dt>-? (show help) </dt>  
    -<dd>Show help, then exit.
    -
    -<dt>-\\\-version  </dt> 
    -<dd>Show the version of this utility, then exit.</dd>
    -
    -
    -**Connection Options**
    -
    -<dt>-h \<hostname\> </dt>
    -<dd>Specifies the host name of the machine on which the HAWQ master 
database server is running. If not specified, reads from the environment 
variable `$PGHOST` or defaults to `localhost`.</dd>
    -
    -<dt> -p \<port\> </dt> 
    -<dd>Specifies the TCP port on which the HAWQ master database server is 
listening for connections. If not specified, reads from the environment 
variable `$PGPORT` or defaults to 5432.</dd>
    +Metadata for the Parquet file(s) and the destination table must be 
consistent. Different  data types are used by HAWQ tables and Parquet files, so 
the data is mapped. Refer to the section [Data Type 
Mapping](hawqregister.html#topic1__section7) below. You must verify that the 
structure of the Parquet files and the HAWQ table are compatible before running 
`hawq register`. 
     
    -<dt>-U \<username\> </dt> 
    -<dd>The database role name to connect as. If not specified, reads from the 
environment variable `$PGUSER` or defaults to the current system user name.</dd>
    +####Limitations
    +Only HAWQ or Hive-generated Parquet tables are supported.
    +Hash tables and artitioned tables are not supported in this use model.
     
    -<dt>-d  , --database \<databasename\>  </dt>
    -<dd>The database to register the parquet HDFS data into. The default is 
`postgres`<dd>
    +###Usage Model 2: Use information from a YAML configuration file to 
register data
      
    -<dt>-t , --tablename \<tablename\> </dt>
    -<dd>The HAWQ table that will store the parquet data. The table cannot use 
hash distribution: only tables using random distribution can be registered into 
HAWQ.</dd>
    -
    -<dt>-f , --filepath \<hdfspath\></dt>
    -<dd>The path of the file or directory in HDFS containing the files to be 
registered.</dd>
    -
    -<dt>-c , --config \<yml_config\> </dt> 
    -<dd>Registers a YAML-format configuration file into HAWQ.</dd>
    -
    -
    +`hawq register [-h hostname] [-p port] [-U username] [-d databasename] [-c 
configfile] [--force] <tablename>`
     
    -## <a id="topic1__section6"></a>Examples
    +Files generated by the `hawq extract` command are registered through use 
of metadata in a YAML configuration file. Both AO and Parquet tables can be 
registered. Tables need not exist in HAWQ before being registered.
     
    -This example shows how to register a HIVE-generated parquet file in HDFS 
into the table `parquet_table` in HAWQ, which is in the database named 
`postgres`. The file path of the HIVE-generated file is 
`hdfs://localhost:8020/temp/hive.paq`.
    -
    -For the purposes of this example, assume that the location of the database 
is `hdfs://localhost:8020/hawq_default`, the tablespace id is 16385, the 
database id is 16387, the table filenode id is 77160, and the last file under 
the filenode is numbered 7.
    -
    -Enter:
    -
    -``` pre
    -$ hawq register postgres parquet_table hdfs://localhost:8020/temp/hive.paq
    -```
    +The register process behaves differently, according to different 
conditions. 
     
    -After running the `hawq register` command for the file location  
`hdfs://localhost:8020/temp/hive.paq`, the corresponding new location of the 
file in HDFS is:  `hdfs://localhost:8020/hawq_default/16385/16387/77160/8`. The 
command then updates the metadata of the table `parquet_table` in HAWQ, which 
is contained in the table `pg_aoseg.pg_paqseg_77160`. The pg\_aoseg is a fixed 
schema for row-oriented and parquet ao tables. For row-oriented tables, table 
name prefix is pg\_aoseg. The table name prefix for parquet tables is 
pg\_paqseg. 77160 is the relation id of the table.
    +-  Existing tables have files appended to the existing HAWQ table.
    +-  If a table does not exist, it is created and registered into HAWQ. 
    +-  If the -\-force option is used, the data in existing catalog tables is 
erased and re-registered.
     
    -To locate the table, you can either find the relation ID by looking up the 
catalog table pg\_class by running `select oid from pg_class where 
relname=$relname` or by finding the table name by using the command `select 
segrelid from pg_appendonly where relid = $relid` then running `select relname 
from pg_class where oid = segrelid`.
    +###Limitations for Registering Hive Tables to HAWQ
    +The currently-supported data types for generating Hive tables into HAWQ 
tables are: boolean, int, smallint, tinyint, bigint, float, double, string, 
binary, char, and varchar.  
     
    -**Recommendation:** Before running ```hawq register```, create a copy of 
the parquet file to be registered, then run ```hawq register``` on the copy. 
This leaves the original file available for additional Hive queries or if a 
data mapping error is encountered.
    +The following HIVE data types cannot be converted to HAWQ equivalents: 
timestamp, decimal, array, struct, map, and union.   
     
    -##Data Type Mapping<a id="topic1__section7"></a>
    +###Data Type Mapping<a id="topic1__section7"></a>
     
    -HAWQ and parquet tables and HIVE and HAWQ tables use different data types. 
Mapping must be used for compatibility. You are responsible for making sure 
your implementation is mapped to the appropriate data type before running `hawq 
register`. The tables below show equivalent data types, if available.
    --- End diff --
    
    See previous edit.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #17: Updates for hawq register

Reply via email to