more content and rearranging of pxf hdfs plugin page
Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/2da7a92a Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/2da7a92a Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/2da7a92a Branch: refs/heads/develop Commit: 2da7a92a3e8431335a48005d55a70c9eba333e16 Parents: 9ca2779 Author: Lisa Owen <[email protected]> Authored: Mon Oct 17 16:27:23 2016 -0700 Committer: Lisa Owen <[email protected]> Committed: Mon Oct 17 16:27:23 2016 -0700 ---------------------------------------------------------------------- pxf/HDFSFileDataPXF.html.md.erb | 331 ++++++++--------------------------- 1 file changed, 76 insertions(+), 255 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/2da7a92a/pxf/HDFSFileDataPXF.html.md.erb ---------------------------------------------------------------------- diff --git a/pxf/HDFSFileDataPXF.html.md.erb b/pxf/HDFSFileDataPXF.html.md.erb index e1c621f..e49688e 100644 --- a/pxf/HDFSFileDataPXF.html.md.erb +++ b/pxf/HDFSFileDataPXF.html.md.erb @@ -17,27 +17,27 @@ Before working with HDFS file data using HAWQ and PXF, ensure that: The PXF HDFS plug-in supports the following file formats: -- TextFile - comma-separated value or delimited format plain text file +- TextFile - comma-separated value (.csv) or delimited format plain text file - SequenceFile - flat file consisting of binary key/value pairs - Avro - JSON-defined, schema-based data serialization format The PXF HDFS plug-in includes the following profiles to support the file formats listed above: -- `HdfsTextSimple` -- `HdfsTextMulti` -- `SequenceWritable` -- `Avro` +- `HdfsTextSimple` - text files +- `HdfsTextMulti` - text files with embedded line feeds +- `SequenceWritable` - SequenceFile +- `Avro` - Avro files ## <a id="hdfsplugin_datatypemap"></a>Data Type Mapping jjj ## <a id="hdfsplugin_cmdline"></a>HDFS Shell Commands -HAWQ includes command-line tools that interact directly with HDFS. These tools support typical file system operations including copying, listing, changing file permissions, etc. +Hadoop includes command-line tools that interact directly with HDFS. These tools support typical file system operations including copying and listing files, changing file permissions, etc. The HDFS file system command is `hdfs dfs <options> [<file>]`. Invoked with no options, `hdfs dfs` lists the file system options supported by the tool. -`hdfs dfs` options used in this section are listed in the table below: +`hdfs dfs` options used in this section are identified in the table below: | Option | Description | |-------|-------------------------------------| @@ -57,7 +57,7 @@ Create a delimited plain text file: $ vi /tmp/pxf_hdfs_ts.txt ``` -Add the following data to `pxf_hdfs_ts.txt`: +Copy and paste the following data into `pxf_hdfs_ts.txt`: ``` pre Prague,Jan,101,4875.33 @@ -66,7 +66,7 @@ Bangalore,May,317,8936.99 Beijing,Jul,411,11600.67 ``` -Notice the use of the comma `,` to separate field values. +Notice the use of the comma `,` to separate the four data fields. Add the data file to HDFS: @@ -86,7 +86,7 @@ Create a second delimited plain text file: $ vi /tmp/pxf_hdfs_tm.txt ``` -Add the following data to `pxf_hdfs_tm.txt`: +Copy/paste the following data into `pxf_hdfs_tm.txt`: ``` pre "4627 Star Rd. @@ -101,7 +101,7 @@ Chicago, IL 60605":Jul:2017 Columbus, OH 43213":Dec:2018 ``` -Notice the use of the colon `:` to separate field values. Also notice the quotes around the first/address field. This field includes an embedded line feed. +Notice the use of the colon `:` to separate the three fields. Also notice the quotes around the first (address) field. This field includes an embedded line feed. Add the data file to HDFS: @@ -112,7 +112,7 @@ $ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_tm.txt /data/pxf_examples/ You will use these HDFS files in later sections. ## <a id="hdfsplugin_queryextdata"></a>Querying External HDFS Data -The PXF HDFS plug-in supports several HDFS-related profiles. These include `HdfsTextSimple`, `HdfsTextMulti`, `SequenceWritable`, and `Avro`. +The PXF HDFS plug-in supports several profiles. These include `HdfsTextSimple`, `HdfsTextMulti`, `SequenceWritable`, and `Avro`. Use the following syntax to create a HAWQ external table representing HDFS data: @@ -126,24 +126,29 @@ FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>); HDFS-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described in the table below. -**Note**: Some profile-specific options and properties may be discussed in the relevant profile section later in this topic. - | Keyword | Value | |-------|-------------------------------------| -| host | The HDFS NameNode. | -| \<path-to-hdfs-file\> | path to the file in the HDFS data store | -| PROFILE | The `PROFILE` keyword must specify one of the values `HdfsTextSimple`, `HdfsTextMulti`, `SequenceWritable` or `Avro`. | -| \<custom-option\> | \<custom-option\> is profile-specific. | -| FORMAT 'TEXT' | Use `TEXT` `FORMAT` with the `HdfsTextSimple` profile when \<path-to-hdfs-file\> references a plain text delimited file. | -| FORMAT 'CSV' | Use `CSV` `FORMAT` with `HdfsTextSimple` and `HdfsTextMulti` profiles when \<path-to-hdfs-file\> references a comma-separated value file. | -| FORMAT 'CUSTOM' | Use the`CUSTOM` `FORMAT` with `Avro` and `SequenceWritable` profiles. The `CUSTOM` format supports only the built-in `formatter='pxfwritable_export'` \<formatting-property\>. | -| \<formatting-properties\> | \<formatting-properties\> are profile-specific. | +| \<host\>[:\<port\>] | The HDFS NameNode and port. | +| \<path-to-hdfs-file\> | The path to the file in the HDFS data store. | +| PROFILE | The `PROFILE` keyword must specify one of the values `HdfsTextSimple`, `HdfsTextMulti`, `SequenceWritable`, or `Avro`. | +| \<custom-option\> | \<custom-option\> is profile-specific. Profile-specific options are discussed in the relevant profile topic later in this section.| +| FORMAT 'TEXT' | Use '`TEXT`' `FORMAT` with the `HdfsTextSimple` profile when \<path-to-hdfs-file\> references a plain text delimited file. | +| FORMAT 'CSV' | Use '`CSV`' `FORMAT` with `HdfsTextSimple` and `HdfsTextMulti` profiles when \<path-to-hdfs-file\> references a comma-separated value file. | +| FORMAT 'CUSTOM' | Use the`CUSTOM` `FORMAT` with `Avro` and `SequenceWritable` profiles. The '`CUSTOM`' `FORMAT` supports only the built-in `(formatter='pxfwritable_export')` \<formatting-property\> | + \<formatting-properties\> | \<formatting-properties\> are profile-specific. Profile-specific formatting options are discussed in the relevant profile topic later in this section. | + +*Note*: When creating PXF external tables, you cannot use the `HEADER` option in your `FORMAT` specification. -Note: When creating PXF external tables, you cannot use the `HEADER` option in your `FORMAT` specification. +## <a id="profile_hdfstextsimple"></a>HdfsTextSimple Profile -### <a id="profile_hdfstextsimple"></a>HdfsTextSimple Profile +Use the `HdfsTextSimple` profile when reading plain text delimited or .csv files where each row is a single record. + +\<formatting-properties\> supported by the `HdfsTextSimple` profile include: + +| Keyword | Value | +|-------|-------------------------------------| +| delimiter | The delimiter character in the file. Default value is a comma `,`.| -Use the `HdfsTextSimple` profile when reading plain text delimited or csv files where each row is a single record. The following SQL call uses the PXF `HdfsTextSimple` profile to create a queryable HAWQ external table from the `pxf_hdfs_ts.txt` file you created and added to HDFS in an earlier section: @@ -173,12 +178,18 @@ gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple_csv(location text, month tex gpadmin=# SELECT * FROM pxf_hdfs_textsimple_csv; ``` -When specifying `FORMAT 'CSV'` for a comma-separated value file, no `delimiter` formatter option is required, as comma is the default delimiter. +Notice: When specifying `FORMAT 'CSV'` for a comma-separated value file, no `delimiter` formatter option is required, as comma is the default. -### <a id="profile_hdfstextmulti"></a>HdfsTextMulti Profile +## <a id="profile_hdfstextmulti"></a>HdfsTextMulti Profile Use the `HdfsTextMulti` profile when reading plain text files with delimited single- or multi- line records that include embedded (quoted) linefeed characters. +\<formatting-properties\> supported by the `HdfsTextMulti` profile include: + +| Keyword | Value | +|-------|-------------------------------------| +| delimiter | The delimiter character in the file. | + The following SQL call uses the PXF `HdfsTextMulti` profile to create a queryable HAWQ external table from the `pxf_hdfs_tm.txt` file you created and added to HDFS in an earlier section: ``` sql @@ -204,25 +215,28 @@ gpadmin=# SELECT * FROM pxf_hdfs_textmulti; (5 rows) ``` -### <a id="profile_hdfsseqwritable"></a>SequenceWritable Profile +## <a id="profile_hdfsseqwritable"></a>SequenceWritable Profile Use the `SequenceWritable` profile when reading SequenceFile format files. Files of this type consist of binary key/value pairs. Sequence files are a common data transfer format between MapReduce jobs. -The `SequenceWritable` profile supports the following \<custom-options\> or \<formatting-properties\>: +The `SequenceWritable` profile supports the following \<custom-options\>: | Keyword | Value Description | |-------|-------------------------------------| -| COMPRESSION_CODEC | Specifies the compression codec Java class name | +| COMPRESSION_CODEC | The compression codec Java class name.| | COMPRESSION_TYPE | The compression type of the sequence file; supported values are `RECORD` (the default) or `BLOCK`. | -| \<path-to-hdfs-file\> | path to the file in the HDFS data store | +| DATA-SCHEMA | The name of the writer serialization class. The jar file in which this class resides must be in the PXF class path. This option has no default value. | +| THREAD-SAFE | Boolean value determining if a table query can run in multi-thread mode. Default value is `TRUE` - requests can run in multi-thread mode. When set to `FALSE`, requests will be handled in a single thread. | -MORE HERE +???? MORE HERE -### <a id="profile_hdfsavro"></a>Avro Profile +??? ADDRESS SERIALIZATION + +## <a id="profile_hdfsavro"></a>Avro Profile Avro files store metadata with the data. Avro files also allow specification of an independent schema used when reading the file. -#### <a id="profile_hdfsavrodatamap"></a>Data Type Mapping +### <a id="profile_hdfsavrodatamap"></a>Data Type Mapping To represent Avro data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type. @@ -230,8 +244,6 @@ Avro supports complex data types including arrays, maps, records, enumerations, The following table summarizes external mapping rules for Avro data. -<caption><span class="tablecap">Table 2. Avro Data Type Mapping</span></caption> - <a id="topic_oy3_qwm_ss__table_j4s_h1n_ss"></a> | Avro Data Type | PXF Type | @@ -241,25 +253,23 @@ The following table summarizes external mapping rules for Avro data. | Complex type: Fixed | BYTEA | | Union | Follows the above conventions for primitive or complex data types, depending on the union; supports Null values. | -#### <a id="profile_hdfsavroptipns"></a>Avro-Specific Formatting Options +### <a id="profile_hdfsavroptipns"></a>Avro-Specific Custom Options For complex types, the PXF Avro profile inserts default delimiters between collection items and values. You can use non-default delimiter characters by identifying values for specific Avro custom options in the `CREATE EXTERNAL TABLE` call. -The Avro profile supports the following custom options: - -<caption><span class="tablecap">Table N. Avro Formatting Options</span></caption> +The Avro profile supports the following \<custom-options\>: | Option Name | Description |---------------|--------------------| | COLLECTION_DELIM | The delimiter character(s) to place between entries in a top-level array, map, or record field when PXF maps a Avro complex data type to a text column. The default is a comma `,` character. | | MAPKEY_DELIM | The delimiter character(s) to place between the key and value of a map entry when PXF maps an Avro complex data type to a text column. The default is a colon `:` character. | -| RECORDKEY_DELIM | The delimiter character(s) to place between the field name and value of a record entry when PXF maps an Avro complex data type to a text colum. The default is a colon `:` character. | -| SCHEMA-DATA | The data schema file used to create and read the HDFS file. For an Avro file, you may create an avsc. This option has no default value. | -| THREAD-SAFE | Determines if the table query can run in multi-thread mode or not. Allowed values are `TRUE`, `FALSE`. Default value is `TRUE` - requests can run in multithread mode. When set to FALSE, requests will be handled in a single thread. | +| RECORDKEY_DELIM | The delimiter character(s) to place between the field name and value of a record entry when PXF maps an Avro complex data type to a text column. The default is a colon `:` character. | +| SCHEMA-DATA | The data schema file used to create and read the HDFS file. This option has no default value. | +| THREAD-SAFE | Boolean value determining if a table query can run in multi-thread mode. Default value is `TRUE` - requests can run in multi-thread mode. When set to `FALSE`, requests will be handled in a single thread. | -#### <a id="topic_tr3_dpg_ts__section_m2p_ztg_ts"></a>Avro Schemas +### <a id="topic_tr3_dpg_ts__section_m2p_ztg_ts"></a>Avro Schemas -Avro schemas are defined using JSON, and composed of the same primitive and complex types identified in the data mapping section above. Avro schema file typically have a `.avsc` suffix. +Avro schemas are defined using JSON, and composed of the same primitive and complex types identified in the data mapping section above. Avro schema files typically have a `.avsc` suffix. Fields in an Avro schema file are defined via an array of objects, each of which is specified by a name and a type. The field type is another schema object @@ -324,7 +334,7 @@ Copy and paste the following text into `avro_schema.avsc`: An Avro schema, together with its data, is fully self-describing. -#### <a id="topic_tr3_dpg_ts__section_spk_15g_ts"></a>Sample Avro Data (JSON) +### <a id="topic_tr3_dpg_ts__section_spk_15g_ts"></a>Sample Avro Data (JSON) Create a text file named `pxf_hdfs_avro.txt`: @@ -342,7 +352,7 @@ Enter the following data into `pxf_hdfs_avro.txt`: The sample data uses a comma `,` to separate top level records and a colon `:` to separate map/key values and record field name/values. -Convert the text file to Avro format. There are various ways to perform the conversion programmatically and via the command line. In this example, we are using the [Java Avro tools](http://avro.apache.org/releases.html) and the jar file resides in the current directory: +Convert the text file to Avro format. There are various ways to perform the conversion programmatically and via the command line. In this example, we use the [Java Avro tools](http://avro.apache.org/releases.html); the jar file resides in the current directory: ``` shell $ java -jar ./avro-tools-1.8.1.jar fromjson --schema-file /tmp/avro_schema.avsc /tmp/pxf_hdfs_avro.txt > /tmp/pxf_hdfs_avro.avro @@ -397,173 +407,14 @@ gpadmin=# SELECT username, address FROM followers_view WHERE followers @> '{john jim | {number:9,street:deer creek,city:palo alto} ``` -~~XXXXX - - -### <a id="resolver"></a>Resolver - -Choose the Resolver format if data records are serialized in the HDFS file. - -**Note:** You must include a Profile or a Resolver in the table definition. - -<table> -<colgroup> -<col width="33%" /> -<col width="33%" /> -<col width="33%" /> -</colgroup> -<thead> -<tr class="header"> -<th>Record Serialization</th> -<th>Resolver</th> -<th>Comments</th> -</tr> -</thead> -<tbody> -<tr class="odd"> -<td>Avro</td> -<td>org.apache.hawq.pxf.plugins.âhdfs.AvroResolver</td> -<td><ul> -<li>Avro files include the record schema, Avro serialization can be used in other file types (e.g, Sequence File). </li> -<li>For Avro serialized records outside of an Avro file, include a schema file name (.avsc) in the url under the optional <code class="ph codeph">Schema-Data </code>option.</li> -<li>Deserialize Only (Read) .</li> -</ul></td> -</tr> -<tr class="even"> -<td>Java Writable</td> -<td>org.apache.hawq.pxf.plugins.âhdfs.WritableResolver</td> -<td><ul> -<li>Include the name of the Java class that uses Writable serialization in the URL under the optional <code class="ph codeph">Schema-Data.</code></li> -<li>The class file must exist in the public stage directory (or in Hadoop's class path).</li> -<li>Deserialize and Serialize (Read + Write). </li> -<li>See <a href="#customizedwritableschemafileguidelines">Customized Writable Schema File Guidelines</a>.</li> -</ul></td> -</tr> -<tr class="odd"> -<td>None (plain text)</td> -<td>org.apache.hawq.pxf.plugins.âhdfs.StringPassResolver</td> -<td><ul> -<li>Does not serialize plain text records. The database parses plain records. Passes records as they are.</li> -<li>Deserialize and Serialize (Read + Write).</li> -</ul></td> -</tr> -</tbody> -</table> - -#### <a id="customizedwritableschemafileguidelines"></a>Schema File Guidelines for WritableResolver - -When using a WritableResolver, a schema file needs to be defined. The file needs to be a Java class file and must be on the class path of PXF. - -The class file must follow the following requirements: - -1. Must implement org.apache.hadoop.io.Writable interface. -2. WritableResolver uses reflection to recreate the schema and populate its fields (for both read and write). Then it uses the Writable interface functions to read/write. Therefore, fields must be public, to enable access to them. Private fields will be ignored. -3. Fields are accessed and populated in the order in which they are declared in the class file. -4. Supported field types: - - boolean - - byte array - - double - - float - - int - - long - - short - - string - - Arrays of any of the above types are supported, but the constructor must define the array size so the reflection will work. - -### <a id="additionaloptions"></a>Additional Options - -<a id="additionaloptions__table_skq_kpz_4p"></a> - -<table> -<caption><span class="tablecap">Table 1. Additional PXF Options</span></caption> -<colgroup> -<col width="50%" /> -<col width="50%" /> -</colgroup> -<thead> -<tr class="header"> -<th>Option Name</th> -<th>Description</th> -</tr> -</thead> -<tbody> -<tr class="odd"> -<td>COLLECTION_DELIM</td> -<td>(Avro or Hive profiles only.) The delimiter character(s) to place between entries in a top-level array, map, or record field when PXF maps a Hive or Avro complex data type to a text column. The default is a "," character.</td> -</tr> -<tr class="even"> -<td>COMPRESSION_CODEC</td> -<td><ul> -<li>Useful for WRITABLE PXF tables.</li> -<li>Specifies the compression codec class name for compressing the written data. The class must implement the org.apache.hadoop.io.compress.CompressionCodec interface.</li> -<li> Some valid values are org.apache.hadoop.io.compress.DefaultCodec org.apache.hadoop.io.compress.GzipCodec org.apache.hadoop.io.compress.BZip2Codec.</li> -<li>Note: org.apache.hadoop.io.compress.BZip2Codec runs in a single thread and can be slow.</li> -<li>This option has no default value. </li> -<li>When the option is not defined, no compression will be done.</li> -</ul></td> -</tr> -<tr class="odd"> -<td>COMPRESSION_TYPE</td> -<td><ul> -<li>Useful WRITABLE PXF tables with SequenceFileAccessor.</li> -<li>Ignored when COMPRESSION_CODEC is not defined.</li> -<li>Specifies the compression type for sequence file.</li> -<li>Valid options are: -<ul> -<li>RECORD - only the value part of each row is compressed.</li> -<li>BLOCK - both keys and values are collected in 'blocks' separately and compressed.</li> -</ul></li> -<li>Default value: RECORD.</li> -</ul></td> -</tr> -<tr class="even"> -<td>MAPKEY_DELIM</td> -<td>(Avro or Hive profiles only.) The delimiter character(s) to place between the key and value of a map entry when PXF maps a Hive or Avro complex data type to a text colum. The default is a ":" character.</td> -</tr> -<tr class="odd"> -<td>RECORDKEY_DELIM</td> -<td>(Avro profile only.) The delimiter character(s) to place between the field name and value of a record entry when PXF maps an Avro complex data type to a text colum. The default is a ":" character.</td> -</tr> -<tr class="even"> -<td>SCHEMA-DATA</td> -<td>The data schema file used to create and read the HDFS file. For example, you could create an avsc (for Avro), or a Java class (for Writable Serialization) file. Make sure that you have added any JAR files containing the schema to <code class="ph codeph">pxf-public.classpath</code>. -<p>This option has no default value.</p></td> -</tr> -<tr class="odd"> -<td>THREAD-SAFE</td> -<td>Determines if the table query can run in multithread mode or not. When set to FALSE, requests will be handled in a single thread. -<p>Should be set when a plug-in or other elements that are not thread safe are used (e.g. compression codec).</p> -<p>Allowed values: TRUE, FALSE. Default value is TRUE - requests can run in multithread mode.</p></td> -</tr> -<tr class="even"> -<td> <custom></td> -<td>Any option added to the pxf URI string will be accepted and passed, along with its value, to the Fragmenter, Accessor, and Resolver implementations.</td> -</tr> -</tbody> -</table> - -## <a id="accessingdataonahighavailabilityhdfscluster"></a>Accessing HDFS Data in a High Availability HDFS Cluster - -To access data in a High Availability HDFS cluster, change the \<host\> provided in the URI LOCATION clause. Use *HA\_nameservice* rather than *name\_node\_host:51200*. -``` sql -CREATE EXTERNAL TABLE <table_name> - ( <column_name> <data_type> [, ...] | LIKE <other_table> ) -LOCATION ('pxf://<HA-nameservice>/<path-to-hdfs-file> - ?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro|SequenceWritable[&<custom-option>=<value>[...]]') -FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>); -``` +## <a id="recordkeyinkey-valuefileformats"></a>Reading the Record Key -The opposite is true when a highly available HDFS cluster is reverted to a single namenode configuration. In that case, any table definition that has the \<HA-nameservice\> specified should use the \<host\>[:\<port\>] syntax. +Sequence file and other file formats that store rows in a key-value format can access the key value through HAWQ by using the `recordkey` keyword as a field name. -## <a id="recordkeyinkey-valuefileformats"></a>Using a Record Key with Key-Value File Formats +The field type of `recordkey` must correspond to the key type, much as the other fields must match the HDFS data. -For sequence file and other file formats that store rows in a key-value format, the key value can be accessed through HAWQ by using the saved keyword '`recordkey`' as a field name. - -The field type must correspond to the key type, much as the other fields must match the HDFS data. - -WritableResolver supports read and write of recordkey, which can be of the following Writable Hadoop types: +`recordkey` can be of the following Hadoop types: - BooleanWritable - ByteWritable @@ -573,59 +424,29 @@ WritableResolver supports read and write of recordkey, which can be of the follo - LongWritable - Text -If the `recordkey` field is not defined, the key is ignored in read, and a default value (segment id as LongWritable) is written in write. - ### <a id="example1"></a>Example -A data schema `Babies.class` contains three fields: (name text, birthday text, weight float). An external table must include these three fields, and can either include or ignore the recordkey. +A data schema `Babies.class` contains three fields: name (text), birthday (text), weight (float). An external table definition for this schema must include these three fields, and can either include or ignore the `recordkey`. ``` sql --- writable table with recordkey -CREATE WRITABLE EXTERNAL TABLE babies_registry (recordkey int, name text, birthday text, weight float) - LOCATION ('pxf://namenode_host:51200/babies_1940s' - '?ACCESSOR=org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor' - '&RESOLVER=org.apache.hawq.pxf.plugins.hdfs.WritableResolver' - '&DATA-SCHEMA=Babies') - FORMAT 'CUSTOM' (formatter='pxfwritable_export'); -INSERT INTO babies_registry VALUES (123456, "James Paul McCartney", "June 18, 1942", 3.800); - --- writable table without recordkey -CREATE WRITABLE EXTERNAL TABLE babies_registry2 (name text, birthday text, weight float) - LOCATION ('pxf://namenode_host:51200/babies_1940s' - '?ACCESSOR=org.apache.hawq.pxf.plugins.SequenceFileAccessor' - '&RESOLVER=org.apache.hawq.pxf.plugins.WritableResolver' - '&DATA-SCHEMA=Babies') - FORMAT 'CUSTOM' (formatter='pxfwritable_export'); - --- this record's key will have some default value -INSERT INTO babies_registry VALUES ("Richard Starkey", "July 7, 1940", 4.0); +gpadmin=# CREATE EXTERNAL TABLE babies_1940 (recordkey int, name text, birthday text, weight float) + LOCATION ('pxf://namenode:51200/babies_1940s?PROFILE=SequenceWritable&DATA-SCHEMA=Babies') + FORMAT 'CUSTOM' (formatter='pxfwritable_import'); +gpadmin=# SELECT * FROM babies_1940; ``` -The same goes for reading data from an existing file with a key-value format, e.g. a Sequence file. +## <a id="accessdataonahavhdfscluster"></a>Accessing HDFS Data in a High Availability HDFS Cluster + +To access external HDFS data in a High Availability HDFS cluster, change the URI LOCATION clause to use \<HA-nameservice\> rather than \<host\>[:\<port\>]. ``` sql --- readable table with recordkey -CREATE EXTERNAL TABLE babies_1940 (recordkey int, name text, birthday text, weight float) - LOCATION ('pxf://namenode_host:51200/babies_1940s' - '?FRAGMENTER=org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter' - '&ACCESSOR=org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor' - '&RESOLVER=org.apache.hawq.pxf.plugins.hdfs.WritableResolver' - '&DATA-SCHEMA=Babies') - FORMAT 'CUSTOM' (formatter='pxfwritable_import'); --- retrieve each record's key -SELECT * FROM babies_1940; - --- readable table without recordkey -CREATE EXTERNAL TABLE babies_1940_2 (name text, birthday text, weight float) - LOCATION ('pxf://namenode_host:51200/babies_1940s' - '?FRAGMENTER=org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter' - '&ACCESSOR=org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor' - '&RESOLVER=org.apache.hawq.pxf.plugins.hdfs.WritableResolver' - '&DATA-SCHEMA=Babies') - FORMAT 'CUSTOM' (formatter='pxfwritable_import'); --- ignores the records' key -SELECT * FROM babies_1940_2; +gpadmin=# CREATE EXTERNAL TABLE <table_name> ( <column_name> <data_type> [, ...] | LIKE <other_table> ) + LOCATION ('pxf://<HA-nameservice>/<path-to-hdfs-file>?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro|SequenceWritable[&<custom-option>=<value>[...]]') + FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>); ``` +The opposite is true when a highly available HDFS cluster is reverted to a single NameNode configuration. In that case, any table definition that has the \<HA-nameservice\> specified should use the \<host\>[:\<port\>] syntax. + + ## <a id="hdfs_advanced"></a>Advanced -If you find that the pre-defined PXF HDFS profiles do not meet your needs, you may choose to create a custom HDFS profile from the existing HDFS Accessors and Resolvers. Refer to [XX]() for information on \ No newline at end of file +If you find that the pre-defined PXF HDFS profiles do not meet your needs, you may choose to create a custom HDFS profile from the existing HDFS Accessors and Resolvers. Refer to [Adding and Updating Profiles](ReadWritePXF.html#addingandupdatingprofiles) for information on creating a custom profile. \ No newline at end of file
