[GitHub] incubator-hawq-docs pull request #17: Updates for hawq register

dyozie Fri, 30 Sep 2016 11:56:18 -0700

Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/17#discussion_r81390790
  
    --- Diff: datamgmt/load/g-register_files.html.md.erb ---
    @@ -0,0 +1,213 @@
    +---
    +title: Registering Files into HAWQ Internal Tables
    +---
    +
    +The `hawq register` utility loads and registers HDFS data files or folders 
into HAWQ internal tables. Files can be read directly, rather than having to be 
copied or loaded, resulting in higher performance and more efficient 
transaction processing.
    +
    +Data from the file or directory specified by \<hdfsfilepath\> is loaded 
into the appropriate HAWQ table directory in HDFS and the utility updates the 
corresponding HAWQ metadata for the files. Either AO for Parquet-formatted in 
HDFS can be loaded into a corresponding table in HAWQ.
    +
    +You can use `hawq register` either to:
    +
    +-  Load and register external Parquet-formatted file data generated by an 
external system such as Hive or Spark.
    +-  Recover cluster data from a backup cluster for disaster recovery. 
    +
    +Requirements for running `hawq register` on the client server are:
    +
    +-   Network access to and from all hosts in your HAWQ cluster (master and 
segments) and the hosts where the data to be loaded is located.
    +-   The Hadoop client configured and the hdfs filepath specified.
    +-   The files to be registered and the HAWQ table must be located in the 
same HDFS cluster.
    +-   The target table DDL is configured with the correct data type mapping.
    +
    +##Registering Externally Generated HDFS File Data to an Existing Table<a 
id="topic1__section2"></a>
    +
    +Files or folders in HDFS can be registered into an existing table, 
allowing them to be managed as a HAWQ internal table. When registering files, 
you can optionally specify the maximum amount of data to be loaded, in bytes, 
using the `--eof` option. If registering a folder, the actual file sizes are 
used. 
    +
    +Only HAWQ or Hive-generated Parquet tables are supported. Partitioned 
tables are not supported. Attempting to register these tables will result in an 
error.
    +
    +Metadata for the Parquet file(s) and the destination table must be 
consistent. Different  data types are used by HAWQ tables and Parquet files, so 
data must be mapped. You must verify that the structure of the parquet files 
and the HAWQ table are compatible before running `hawq register`. 
    +
    +We recommand creating a copy of the Parquet file to be registered before 
running ```hawq register```
    +You can then then run ```hawq register``` on the copy,  leaving the 
original file available for additional Hive queries or if a data mapping error 
is encountered.
    +
    +###Limitations for Registering Hive Tables to HAWQ
    +The currently-supported data types for generating Hive tables into HAWQ 
tables are: boolean, int, smallint, tinyint, bigint, float, double, string, 
binary, char, and varchar.  
    +
    +The following HIVE data types cannot be converted to HAWQ equivalents: 
timestamp, decimal, array, struct, map, and union.   
    +
    +###Example: Registering a Hive-Generated Parquet File
    +
    +This example shows how to register a HIVE-generated parquet file in HDFS 
into the table `parquet_table` in HAWQ, which is in the database named 
`postgres`. The file path of the HIVE-generated file is 
`hdfs://localhost:8020/temp/hive.paq`.
    +
    +In this example, the location of the database is 
`hdfs://localhost:8020/hawq_default`, the tablespace id is 16385, the database 
id is 16387, the table filenode id is 77160, and the last file under the 
filenode is numbered 7.
    +
    +Enter:
    +
    +``` pre
    +$ hawq register -d postgres -f hdfs://localhost:8020/temp/hive.paq 
parquet_table
    +```
    +
    +After running the `hawq register` command for the file location  
`hdfs://localhost:8020/temp/hive.paq`, the corresponding new location of the 
file in HDFS is:  `hdfs://localhost:8020/hawq_default/16385/16387/77160/8`. 
    +
    +The command then updates the metadata of the table `parquet_table` in 
HAWQ, which is contained in the table `pg_aoseg.pg_paqseg_77160`. The pg\_aoseg 
is a fixed schema for row-oriented and Parquet AO tables. For row-oriented 
tables, the table name prefix is pg\_aoseg. The table name prefix for parquet 
tables is pg\_paqseg. 77160 is the relation id of the table.
    +
    +To locate the table, either find the relation ID by looking up the catalog 
table pg\_class in SQL by running 
    +
    +```
    +select oid from pg_class where relname=$relname
    +```
    +or find the table name by using the SQL command 
    +```
    +select segrelid from pg_appendonly where relid = $relid
    +```
    +then running 
    +```
    +select relname from pg_class where oid = segrelid
    +```
    +
    +##Registering Data Using Information from a YAML Configuration File<a 
id="topic1__section3"></a>
    + 
    +The `hawq register` command can register HDFS files  by using metadata 
loaded from a YAML configuration file by using the `--config <yaml_config\>` 
option. Both AO and Parquet tables can be registered. Tables need not exist in 
HAWQ before being registered. This function can be useful in disaster recovery, 
allowing information created by the `hawq extract` command to be used to 
re-create HAWQ tables.
    +
    +You can also use a YAML confguration file to append HDFS files to an 
existing HAWQ table or create a table and register it into HAWQ.
    +
    +For disaster recovery, tables can be re-registered using the HDFS files 
and a YAML file. The clusters are assumed to have data periodically imported 
from Cluster A to Cluster B. 
    +
    +Data is registered according to the following conditions: 
    +
    +-  Existing tables have files appended to the existing HAWQ table.
    +-  If a table does not exist, it is created and registered into HAWQ. The 
catalog table will be updated with the file size specified by the YAML file.
    +-  If the --force option is used, the data in existing catalog tables is 
erased and re-registered. All HDFS-related catalog contents in 
`pg_aoseg.pg_paqseg_$relid ` are cleared. The original files on HDFS are 
retained.
    +
    +Tables using random distribution are preferred for registering into HAWQ. 
If hash tables are to be  registered, the distribution policy in the YAML file 
must match that of the table being registered into. 
    +
    +In registering hash tables, the size of the registered file should be 
identical to or a multiple of the hash table bucket number. When registering 
hash distributed tables using a YAML file, the order of the files in the YAML 
file should reflect the hash distribution.
    +
    +
    +###Example: Registration using a YAML Configuration File
    +
    +This example shows how to use hawq register to register HDFS data using a 
YAML configuration file generated by hawq extract. 
    +
    +First, create a table in SQL and insert some data into it.  
    +
    +```
    +create table paq1(a int, b varchar(10))with(appendonly=true, 
orientation=parquet);
    +```
    +
    +In SQL, run:
    +
    +```
    +insert into paq1 values(generate_series(1,1000), 'abcde');
    +```
    +
    +Go into the hawq administration utilities, and extract the table metadata 
by using the `hawq extract` utility.
    +
    +```
    +hawq extract -o paq1.yml paq1
    +```
    +
    +Register the data into new table paq2, using the --config option to 
identify the YAML file.
    +
    +```
    +hawq register --config paq1.yml paq2
    +```
    +In SQL, select the new table and check to verify that  the content has 
been registered.
    +
    +```
    +select count(*) from paq2;
    +```
    +
    +
    +##Data Type Mapping<a id="topic1__section4"></a>
    +
    +HAWQ and parquet tables and HIVE and HAWQ tables use different data types. 
Mapping must be used for metadata compatibility. You are responsible for making 
sure your implementation is mapped to the appropriate data type before running 
`hawq register`. The tables below show equivalent data types, if available.
    --- End diff --
    
    Reword first sentence:  Hive and parquet tables use different data types 
than HAWQ tables.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #17: Updates for hawq register

Reply via email to