[02/14] incubator-hawq-docs git commit: more content and rearranging of pxf hdfs plugin page

yozie Wed, 26 Oct 2016 11:31:45 -0700

more content and rearranging of pxf hdfs plugin page


Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/2da7a92a
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/2da7a92a
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/2da7a92a

Branch: refs/heads/develop
Commit: 2da7a92a3e8431335a48005d55a70c9eba333e16
Parents: 9ca2779
Author: Lisa Owen <[email protected]>
Authored: Mon Oct 17 16:27:23 2016 -0700
Committer: Lisa Owen <[email protected]>
Committed: Mon Oct 17 16:27:23 2016 -0700

----------------------------------------------------------------------
 pxf/HDFSFileDataPXF.html.md.erb | 331 ++++++++---------------------------
 1 file changed, 76 insertions(+), 255 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/2da7a92a/pxf/HDFSFileDataPXF.html.md.erb
----------------------------------------------------------------------
diff --git a/pxf/HDFSFileDataPXF.html.md.erb b/pxf/HDFSFileDataPXF.html.md.erb
index e1c621f..e49688e 100644
--- a/pxf/HDFSFileDataPXF.html.md.erb
+++ b/pxf/HDFSFileDataPXF.html.md.erb
@@ -17,27 +17,27 @@ Before working with HDFS file data using HAWQ and PXF, 
ensure that:
 
 The PXF HDFS plug-in supports the following file formats:
 
-- TextFile - comma-separated value or delimited format plain text file
+- TextFile - comma-separated value (.csv) or delimited format plain text file
 - SequenceFile - flat file consisting of binary key/value pairs
 - Avro - JSON-defined, schema-based data serialization format
 
 The PXF HDFS plug-in includes the following profiles to support the file 
formats listed above:
 
-- `HdfsTextSimple`
-- `HdfsTextMulti`
-- `SequenceWritable`
-- `Avro`
+- `HdfsTextSimple` - text files
+- `HdfsTextMulti` - text files with embedded line feeds
+- `SequenceWritable` - SequenceFile
+- `Avro` - Avro files
 
 ## <a id="hdfsplugin_datatypemap"></a>Data Type Mapping
 jjj
 
 
 ## <a id="hdfsplugin_cmdline"></a>HDFS Shell Commands
-HAWQ includes command-line tools that interact directly with HDFS.  These 
tools support typical file system operations including copying, listing, 
changing file permissions, etc. 
+Hadoop includes command-line tools that interact directly with HDFS.  These 
tools support typical file system operations including copying and listing 
files, changing file permissions, etc. 
 
 The HDFS file system command is `hdfs dfs <options> [<file>]`. Invoked with no 
options, `hdfs dfs` lists the file system options supported by the tool.
 
-`hdfs dfs` options used in this section are listed in the table below:
+`hdfs dfs` options used in this section are identified in the table below:
 
 | Option  | Description |
 |-------|-------------------------------------|
@@ -57,7 +57,7 @@ Create a delimited plain text file:
 $ vi /tmp/pxf_hdfs_ts.txt
 ```
 
-Add the following data to `pxf_hdfs_ts.txt`:
+Copy and paste the following data into `pxf_hdfs_ts.txt`:
 
 ``` pre
 Prague,Jan,101,4875.33
@@ -66,7 +66,7 @@ Bangalore,May,317,8936.99
 Beijing,Jul,411,11600.67
 ```
 
-Notice the use of the comma `,` to separate field values.
+Notice the use of the comma `,` to separate the four data fields.
 
 Add the data file to HDFS:
 
@@ -86,7 +86,7 @@ Create a second delimited plain text file:
 $ vi /tmp/pxf_hdfs_tm.txt
 ```
 
-Add the following data to `pxf_hdfs_tm.txt`:
+Copy/paste the following data into `pxf_hdfs_tm.txt`:
 
 ``` pre
 "4627 Star Rd.
@@ -101,7 +101,7 @@ Chicago, IL  60605":Jul:2017
 Columbus, OH  43213":Dec:2018
 ```
 
-Notice the use of the colon `:` to separate field values. Also notice the 
quotes around the first/address field. This field includes an embedded line 
feed.
+Notice the use of the colon `:` to separate the three fields. Also notice the 
quotes around the first (address) field. This field includes an embedded line 
feed.
 
 Add the data file to HDFS:
 
@@ -112,7 +112,7 @@ $ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_tm.txt 
/data/pxf_examples/
 You will use these HDFS files in later sections.
 
 ## <a id="hdfsplugin_queryextdata"></a>Querying External HDFS Data
-The PXF HDFS plug-in supports several HDFS-related profiles. These include 
`HdfsTextSimple`, `HdfsTextMulti`, `SequenceWritable`, and `Avro`.
+The PXF HDFS plug-in supports several profiles. These include 
`HdfsTextSimple`, `HdfsTextMulti`, `SequenceWritable`, and `Avro`.
 
 Use the following syntax to create a HAWQ external table representing HDFS 
data:Â 
 
@@ -126,24 +126,29 @@ FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);
 
 HDFS-plug-in-specific keywords and values used in the [CREATE EXTERNAL 
TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described in the 
table below.
 
-**Note**: Some profile-specific options and properties may be discussed in the 
relevant profile section later in this topic.
-
 | Keyword  | Value |
 |-------|-------------------------------------|
-| host    | The HDFS NameNode. |
-| \<path-to-hdfs-file\>    | path to the file in the HDFS data store |
-| PROFILE    | The `PROFILE` keyword must specify one of the values 
`HdfsTextSimple`, `HdfsTextMulti`, `SequenceWritable` or `Avro`. |
-| \<custom-option\>  | \<custom-option\> is profile-specific. |
-| FORMAT 'TEXT' | Use `TEXT` `FORMAT` with the `HdfsTextSimple` profile when 
\<path-to-hdfs-file\> references a plain text delimited file.  |
-| FORMAT 'CSV' | Use `CSV` `FORMAT` with `HdfsTextSimple` and `HdfsTextMulti` 
profiles when \<path-to-hdfs-file\> references a comma-separated value file.  |
-| FORMAT 'CUSTOM' | Use the`CUSTOM` `FORMAT` with `Avro` and 
`SequenceWritable` profiles. The `CUSTOM` format supports only the built-in 
`formatter='pxfwritable_export'` \<formatting-property\>.  |
-| \<formatting-properties\> | \<formatting-properties\> are profile-specific. |
+| \<host\>[:\<port\>]    | The HDFS NameNode and port. |
+| \<path-to-hdfs-file\>    | The path to the file in the HDFS data store. |
+| PROFILE    | The `PROFILE` keyword must specify one of the values 
`HdfsTextSimple`, `HdfsTextMulti`, `SequenceWritable`, or `Avro`. |
+| \<custom-option\>  | \<custom-option\> is profile-specific. Profile-specific 
options are discussed in the relevant profile topic later in this section.|
+| FORMAT 'TEXT' | Use '`TEXT`' `FORMAT` with the `HdfsTextSimple` profile when 
\<path-to-hdfs-file\> references a plain text delimited file.  |
+| FORMAT 'CSV' | Use '`CSV`' `FORMAT` with `HdfsTextSimple` and 
`HdfsTextMulti` profiles when \<path-to-hdfs-file\> references a 
comma-separated value file.  |
+| FORMAT 'CUSTOM' | Use the`CUSTOM` `FORMAT` with `Avro` and 
`SequenceWritable` profiles. The '`CUSTOM`' `FORMAT` supports only the built-in 
`(formatter='pxfwritable_export')` \<formatting-property\> |
+ \<formatting-properties\>    | \<formatting-properties\> are 
profile-specific. Profile-specific formatting options are discussed in the 
relevant profile topic later in this section. |
+
+*Note*: When creating PXF external tables, you cannot use the `HEADER` option 
in your `FORMAT` specification.
 
-Note: When creating PXF external tables, you cannot use the `HEADER` option in 
your `FORMAT` specification.
+## <a id="profile_hdfstextsimple"></a>HdfsTextSimple Profile
 
-### <a id="profile_hdfstextsimple"></a>HdfsTextSimple Profile
+Use the `HdfsTextSimple` profile when reading plain text delimited or .csv 
files where each row is a single record.
+
+\<formatting-properties\> supported by the `HdfsTextSimple` profile include:
+
+| Keyword  | Value |
+|-------|-------------------------------------|
+| delimiter    | The delimiter character in the file. Default value is a comma 
`,`.|
 
-Use the `HdfsTextSimple` profile when reading plain text delimited or csv 
files where each row is a single record.
 
 The following SQL call uses the PXF `HdfsTextSimple` profile to create a 
queryable HAWQ external table from the `pxf_hdfs_ts.txt` file you created and 
added to HDFS in an earlier section:
 
@@ -173,12 +178,18 @@ gpadmin=# CREATE EXTERNAL TABLE 
pxf_hdfs_textsimple_csv(location text, month tex
 gpadmin=# SELECT * FROM pxf_hdfs_textsimple_csv;          
 ```
 
-When specifying `FORMAT 'CSV'` for a comma-separated value file, no 
`delimiter` formatter option is required, as comma is the default delimiter.
+Notice: When specifying `FORMAT 'CSV'` for a comma-separated value file, no 
`delimiter` formatter option is required, as comma is the default.
 
-### <a id="profile_hdfstextmulti"></a>HdfsTextMulti Profile
+## <a id="profile_hdfstextmulti"></a>HdfsTextMulti Profile
 
 Use the `HdfsTextMulti` profile when reading plain text files with delimited 
single- or multi- line records that include embedded (quoted) linefeed 
characters.
 
+\<formatting-properties\> supported by the `HdfsTextMulti` profile include:
+
+| Keyword  | Value |
+|-------|-------------------------------------|
+| delimiter    | The delimiter character in the file. |
+
 The following SQL call uses the PXF `HdfsTextMulti` profile to create a 
queryable HAWQ external table from the `pxf_hdfs_tm.txt` file you created and 
added to HDFS in an earlier section:
 
 ``` sql
@@ -204,25 +215,28 @@ gpadmin=# SELECT * FROM pxf_hdfs_textmulti;
 (5 rows)
 ```
 
-### <a id="profile_hdfsseqwritable"></a>SequenceWritable Profile 
+## <a id="profile_hdfsseqwritable"></a>SequenceWritable Profile 
 
 Use the `SequenceWritable` profile when reading SequenceFile format files. 
Files of this type consist of binary key/value pairs. Sequence files are a 
common data transfer format between MapReduce jobs. 
 
-The `SequenceWritable` profile supports the following \<custom-options\> or 
\<formatting-properties\>:
+The `SequenceWritable` profile supports the following \<custom-options\>:
 
 | Keyword  | Value Description |
 |-------|-------------------------------------|
-| COMPRESSION_CODEC    | Specifies the compression codec Java class name  |
+| COMPRESSION_CODEC    | The compression codec Java class name.|
 | COMPRESSION_TYPE    | The compression type of the sequence file; supported 
values are `RECORD` (the default) or `BLOCK`. |
-| \<path-to-hdfs-file\>    | path to the file in the HDFS data store |
+| DATA-SCHEMA    | The name of the writer serialization class. The jar file in 
which this class resides must be in the PXF class path. This option has no 
default value. |
+| THREAD-SAFE | Boolean value determining if a table query can run in 
multi-thread mode. Default value is `TRUE` - requests can run in multi-thread 
mode. When set to `FALSE`, requests will be handled in a single thread. |
 
-MORE HERE
+???? MORE HERE
 
-### <a id="profile_hdfsavro"></a>Avro Profile
+??? ADDRESS SERIALIZATION
+
+## <a id="profile_hdfsavro"></a>Avro Profile
 
 Avro files store metadata with the data. Avro files also allow specification 
of an independent schema used when reading the file. 
 
-#### <a id="profile_hdfsavrodatamap"></a>Data Type Mapping
+### <a id="profile_hdfsavrodatamap"></a>Data Type Mapping
 
 To represent Avro data in HAWQ, map data values that use a primitive data type 
to HAWQ columns of the same type. 
 
@@ -230,8 +244,6 @@ Avro supports complex data types including arrays, maps, 
records, enumerations,
 
 The following table summarizes external mapping rules for Avro data.
 
-<caption><span class="tablecap">Table 2. Avro Data Type 
Mapping</span></caption>
-
 <a id="topic_oy3_qwm_ss__table_j4s_h1n_ss"></a>
 
 | Avro Data Type                                                    | PXF Type 
                                                                                
                                                                                
                           |
@@ -241,25 +253,23 @@ The following table summarizes external mapping rules for 
Avro data.
 | Complex type: Fixed                                               | BYTEA    
                                                                                
                                                                                
                           |
 | Union                                                             | Follows 
the above conventions for primitive or complex data types, depending on the 
union; supports Null values.                                                    
                 |
 
-#### <a id="profile_hdfsavroptipns"></a>Avro-Specific Formatting Options
+### <a id="profile_hdfsavroptipns"></a>Avro-Specific Custom Options
 
 For complex types, the PXF Avro profile inserts default delimiters between 
collection items and values. You can use non-default delimiter characters by 
identifying values for specific Avro custom options in the `CREATE EXTERNAL 
TABLE` call. 
 
-The Avro profile supports the following custom options:
-
-<caption><span class="tablecap">Table N. Avro Formatting 
Options</span></caption>
+The Avro profile supports the following \<custom-options\>:
 
 | Option Name   | Description       
 |---------------|--------------------|                                         
                                               
 | COLLECTION_DELIM | The delimiter character(s) to place between entries in a 
top-level array, map, or record field when PXF maps a Avro complex data type to 
a text column. The default is a comma `,` character. |
 | MAPKEY_DELIM | The delimiter character(s) to place between the key and value 
of a map entry when PXF maps an Avro complex data type to a text column. The 
default is a colon `:` character. |
-| RECORDKEY_DELIM | The delimiter character(s) to place between the field name 
and value of a record entry when PXF maps an Avro complex data type to a text 
colum. The default is a colon `:` character. |
-| SCHEMA-DATA | The data schema file used to create and readÂ the HDFS file. 
For an Avro file, you may create an avsc. This option has no default value. |
-| THREAD-SAFE | Determines if the table query can run in multi-thread mode or 
not. Allowed values are `TRUE`, `FALSE`. Default value is `TRUE` - requests can 
run in multithread mode. When set to FALSE, requests will be handled in a 
single thread. |
+| RECORDKEY_DELIM | The delimiter character(s) to place between the field name 
and value of a record entry when PXF maps an Avro complex data type to a text 
column. The default is a colon `:` character. |
+| SCHEMA-DATA | The data schema file used to create and readÂ the HDFS file. 
This option has no default value. |
+| THREAD-SAFE | Boolean value determining if a table query can run in 
multi-thread mode. Default value is `TRUE` - requests can run in multi-thread 
mode. When set to `FALSE`, requests will be handled in a single thread. |
 
-#### <a id="topic_tr3_dpg_ts__section_m2p_ztg_ts"></a>Avro Schemas
+### <a id="topic_tr3_dpg_ts__section_m2p_ztg_ts"></a>Avro Schemas
 
-Avro schemas are defined using JSON, and composed of the same primitive and 
complex types identified in the data mapping section above. Avro schema file 
typically have a `.avsc` suffix.
+Avro schemas are defined using JSON, and composed of the same primitive and 
complex types identified in the data mapping section above. Avro schema files 
typically have a `.avsc` suffix.
 
 Fields in an Avro schema file are defined via an array of objects, each of 
which is specified by a name and a type. The field type is another schema object
 
@@ -324,7 +334,7 @@ Copy and paste the following text into `avro_schema.avsc`:
 
 An Avro schema, together with its data, is fully self-describing.  
 
-#### <a id="topic_tr3_dpg_ts__section_spk_15g_ts"></a>Sample Avro Data (JSON)
+### <a id="topic_tr3_dpg_ts__section_spk_15g_ts"></a>Sample Avro Data (JSON)
 
 Create a text file named `pxf_hdfs_avro.txt`:
 
@@ -342,7 +352,7 @@ Enter the following data into `pxf_hdfs_avro.txt`:
 
 The sample data uses a comma `,` to separate top level records and a colon `:` 
to separate map/key values and record field name/values.
 
-Convert the text file to Avro format. There are various ways to perform the 
conversion programmatically and via the command line. In this example, we are 
using the [Java Avro tools](http://avro.apache.org/releases.html) and the jar 
file resides in the current directory:
+Convert the text file to Avro format. There are various ways to perform the 
conversion programmatically and via the command line. In this example, we use 
the [Java Avro tools](http://avro.apache.org/releases.html); the jar file 
resides in the current directory:
 
 ``` shell
 $ java -jar ./avro-tools-1.8.1.jar fromjson --schema-file 
/tmp/avro_schema.avsc /tmp/pxf_hdfs_avro.txt > /tmp/pxf_hdfs_avro.avro
@@ -397,173 +407,14 @@ gpadmin=# SELECT username, address FROM followers_view 
WHERE followers @> '{john
  jim      | {number:9,street:deer creek,city:palo alto}
 ```
 
-~~XXXXX
-
-
-### <a id="resolver"></a>Resolver
-
-Choose the Resolver format if data records are serialized in the HDFS file.Â 
-
-**Note:** You must include a Profile or a Resolver in the table definition.
-
-<table>
-<colgroup>
-<col width="33%" />
-<col width="33%" />
-<col width="33%" />
-</colgroup>
-<thead>
-<tr class="header">
-<th>Record Serialization</th>
-<th>Resolver</th>
-<th>Comments</th>
-</tr>
-</thead>
-<tbody>
-<tr class="odd">
-<td>Avro</td>
-<td>org.apache.hawq.pxf.plugins.âhdfs.AvroResolver</td>
-<td><ul>
-<li>Avro files include the record schema, Avro serialization can be used in 
other file types (e.g, Sequence File).Â </li>
-<li>For Avro serialized records outside of an Avro file, include a schema file 
name (.avsc) in the url under the optionalÂ <code class="ph 
codeph">Schema-DataÂ </code>option.</li>
-<li>Deserialize Only (Read)Â .</li>
-</ul></td>
-</tr>
-<tr class="even">
-<td>Java Writable</td>
-<td>org.apache.hawq.pxf.plugins.âhdfs.WritableResolver</td>
-<td><ul>
-<li>Include the name of the Java class that uses Writable serializationÂ in 
the URL under the optionalÂ <code class="ph codeph">Schema-Data.</code></li>
-<li>The class file must exist in the public stage directory (or in Hadoop's 
class path).</li>
-<li>Deserialize and Serialize (Read + Write).Â </li>
-<li>See <a href="#customizedwritableschemafileguidelines">Customized Writable 
Schema File Guidelines</a>.</li>
-</ul></td>
-</tr>
-<tr class="odd">
-<td>None (plain text)</td>
-<td>org.apache.hawq.pxf.plugins.âhdfs.StringPassResolver</td>
-<td><ul>
-<li>Does not serialize plain text records. The database parsesÂ plain records. 
Passes records as they are.</li>
-<li>Deserialize and Serialize (Read + Write).</li>
-</ul></td>
-</tr>
-</tbody>
-</table>
-
-#### <a id="customizedwritableschemafileguidelines"></a>Schema File Guidelines 
for WritableResolver
-
-When using a WritableResolver, a schema file needs to be defined. The file 
needs to be a Java class file and must be on the class path of PXF.
-
-The class file must follow the following requirements:
-
-1.  Must implement org.apache.hadoop.io.Writable interface.
-2.  WritableResolver uses reflection to recreate the schema and populate its 
fields (for both read and write). Then it uses the Writable interface functions 
to read/write. Therefore, fields must be public, to enable access to them. 
Private fields will be ignored.
-3.  Fields are accessed and populated in the order in which they are declared 
in the class file.
-4.  Supported field types:
-    -   boolean
-    -   byte array
-    -   double
-    -   float
-    -   int
-    -   long
-    -   short
-    -   string
-
-    Arrays of any of the above types are supported, but the constructor must 
define the array size so the reflection will work.
-
-### <a id="additionaloptions"></a>Additional Options
-
-<a id="additionaloptions__table_skq_kpz_4p"></a>
-
-<table>
-<caption><span class="tablecap">Table 1. Additional PXF 
Options</span></caption>
-<colgroup>
-<col width="50%" />
-<col width="50%" />
-</colgroup>
-<thead>
-<tr class="header">
-<th>Option Name</th>
-<th>Description</th>
-</tr>
-</thead>
-<tbody>
-<tr class="odd">
-<td>COLLECTION_DELIM</td>
-<td>(Avro or Hive profiles only.) The delimiter character(s) to place between 
entries in a top-level array, map, or record field when PXF maps a Hive or Avro 
complex data type to a text column. The default is a &quot;,&quot; 
character.</td>
-</tr>
-<tr class="even">
-<td>COMPRESSION_CODEC</td>
-<td><ul>
-<li>Useful for WRITABLE PXF tables.</li>
-<li>Specifies the compression codec class name for compressing the written 
data. The class must implement theÂ 
org.apache.hadoop.io.compress.CompressionCodec interface.</li>
-<li>Â Some valid values areÂ org.apache.hadoop.io.compress.DefaultCodec 
org.apache.hadoop.io.compress.GzipCodec 
org.apache.hadoop.io.compress.BZip2Codec.</li>
-<li>Note: org.apache.hadoop.io.compress.BZip2Codec runs in a single thread and 
can be slow.</li>
-<li>This option has no default value.Â </li>
-<li>When the option is not defined, no compression will be done.</li>
-</ul></td>
-</tr>
-<tr class="odd">
-<td>COMPRESSION_TYPE</td>
-<td><ul>
-<li>Useful WRITABLE PXF tables with SequenceFileAccessor.</li>
-<li>Ignored when COMPRESSION_CODEC is not defined.</li>
-<li>Specifies the compression type for sequence file.</li>
-<li>Valid options are:Â 
-<ul>
-<li>RECORD - only the value part of each row is compressed.</li>
-<li>BLOCK - both keys and values are collected in 'blocks' separately and 
compressed.</li>
-</ul></li>
-<li>Default value: RECORD.</li>
-</ul></td>
-</tr>
-<tr class="even">
-<td>MAPKEY_DELIM</td>
-<td>(Avro or Hive profiles only.) The delimiter character(s) to place between 
the key and value of a map entry when PXF maps a Hive or Avro complex data type 
to a text colum. The default is a &quot;:&quot; character.</td>
-</tr>
-<tr class="odd">
-<td>RECORDKEY_DELIM</td>
-<td>(Avro profile only.) The delimiter character(s) to place between the field 
name and value of a record entry when PXF maps an Avro complex data type to a 
text colum. The default is a &quot;:&quot; character.</td>
-</tr>
-<tr class="even">
-<td>SCHEMA-DATA</td>
-<td>The data schema file used to create and readÂ the HDFS file. For example, 
you could create an avsc (for Avro), or a Java class (for Writable 
Serialization) file. Make sure that you have added any JAR files containing the 
schema to <code class="ph codeph">pxf-public.classpath</code>.
-<p>This option has no default value.</p></td>
-</tr>
-<tr class="odd">
-<td>THREAD-SAFE</td>
-<td>Determines if the table query can run in multithread mode or not. When set 
to FALSE, requests will be handled in a single thread.
-<p>Should be set when a plug-in or other elements that are not thread safe are 
used (e.g. compression codec).</p>
-<p>Allowed values: TRUE, FALSE. Default value is TRUE - requests can run in 
multithread mode.</p></td>
-</tr>
-<tr class="even">
-<td>Â &lt;custom&gt;</td>
-<td>Any option added to the pxf URI string will be accepted and passed, along 
with its value, to the Fragmenter, Accessor, and Resolver implementations.</td>
-</tr>
-</tbody>
-</table>
-
-## <a id="accessingdataonahighavailabilityhdfscluster"></a>Accessing HDFS Data 
in a High Availability HDFS Cluster
-
-ToÂ access data in a High Availability HDFS cluster, change the \<host\>Â 
provided in the URI LOCATION clause. Use *HA\_nameservice* rather than  
*name\_node\_host:51200*.
 
-``` sql
-CREATE EXTERNAL TABLE <table_name> 
-    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
-LOCATION ('pxf://<HA-nameservice>/<path-to-hdfs-file>
-    
?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro|SequenceWritable[&<custom-option>=<value>[...]]')
-FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);
-```
+## <a id="recordkeyinkey-valuefileformats"></a>Reading the Record Key 
 
-The opposite is true when a highly available HDFS cluster is reverted to a 
single namenode configuration. In that case, any table definition that has the 
\<HA-nameservice\> specified should use the \<host\>[:\<port\>] syntax.Â 
+Sequence file and other file formats that store rows in a key-value format can 
access the key value through HAWQ by using the `recordkey` keyword as a field 
name.
 
-## <a id="recordkeyinkey-valuefileformats"></a>Using a Record Key with 
Key-Value File Formats
+The field type of `recordkey` must correspond to the key type, much as the 
other fields must match the HDFS data.Â 
 
-For sequence file and other file formats that store rows in a key-value 
format, the key value can be accessed through HAWQ by using the saved keyword 
'`recordkey`' as a field name.
-
-The field type must correspond to the key type, much as the other fields must 
match the HDFS data.Â 
-
-WritableResolver supports read and write of recordkey, which can be of the 
following Writable Hadoop types:
+`recordkey` can be of the following Hadoop types:
 
 -   BooleanWritable
 -   ByteWritable
@@ -573,59 +424,29 @@ WritableResolver supports read and write of recordkey, 
which can be of the follo
 -   LongWritable
 -   Text
 
-If the `recordkey` field is not defined, the key is ignored in read, and a 
default value (segment id as LongWritable) is written in write.
-
 ### <a id="example1"></a>Example
 
-A data schema `Babies.class` contains three fields: (name text, birthday text, 
weight float).Â An external table must include these three fields, and canÂ 
either include orÂ ignore the recordkey.
+A data schema `Babies.class` contains three fields: name (text), birthday 
(text), weight (float).Â An external table definition for this schema must 
include these three fields, and canÂ either include orÂ ignore the `recordkey`.
 
 ``` sql
--- writable table with recordkey
-CREATE WRITABLE EXTERNAL TABLE babies_registry (recordkey int, name text, 
birthday text, weight float)
-  LOCATION ('pxf://namenode_host:51200/babies_1940s'
-            '?ACCESSOR=org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor'
-            '&RESOLVER=org.apache.hawq.pxf.plugins.hdfs.WritableResolver'
-            '&DATA-SCHEMA=Babies')
-  FORMAT 'CUSTOM' (formatter='pxfwritable_export');
-INSERT INTO babies_registry VALUES (123456, "James Paul McCartney", "June 18, 
1942", 3.800);
-
--- writable table without recordkey
-CREATE WRITABLE EXTERNAL TABLE babies_registry2 (name text, birthday text, 
weight float)
-  LOCATION ('pxf://namenode_host:51200/babies_1940s'
-            '?ACCESSOR=org.apache.hawq.pxf.plugins.SequenceFileAccessor'
-            '&RESOLVER=org.apache.hawq.pxf.plugins.WritableResolver'
-            '&DATA-SCHEMA=Babies')
-  FORMAT 'CUSTOM' (formatter='pxfwritable_export');
-
--- this record's key will have some default value
-INSERT INTO babies_registry VALUES ("Richard Starkey", "July 7, 1940", 4.0);
+gpadmin=# CREATE EXTERNAL TABLE babies_1940 (recordkey int, name text, 
birthday text, weight float)
+            LOCATION 
('pxf://namenode:51200/babies_1940s?PROFILE=SequenceWritable&DATA-SCHEMA=Babies')
+          FORMAT 'CUSTOM' (formatter='pxfwritable_import');
+gpadmin=# SELECT * FROM babies_1940;
 ```
 
-The same goes for reading data from an existing file with a key-value format, 
e.g. a Sequence file.
+## <a id="accessdataonahavhdfscluster"></a>Accessing HDFS Data in a High 
Availability HDFS Cluster
+
+ToÂ access external HDFS data in a High Availability HDFS cluster, change the 
URI LOCATION clause to use \<HA-nameservice\> rather than  \<host\>[:\<port\>].
 
 ``` sql
--- readable table with recordkey
-CREATE EXTERNAL TABLE babies_1940 (recordkey int, name text, birthday text, 
weight float)
-  LOCATION ('pxf://namenode_host:51200/babies_1940s'
-            '?FRAGMENTER=org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter'
-            '&ACCESSOR=org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor'
-            '&RESOLVER=org.apache.hawq.pxf.plugins.hdfs.WritableResolver'
-            '&DATA-SCHEMA=Babies')
-  FORMAT 'CUSTOM' (formatter='pxfwritable_import');
--- retrieve each record's key
-SELECT * FROM babies_1940;
-
--- readable table without recordkey
-CREATE EXTERNAL TABLE babies_1940_2 (name text, birthday text, weight float)
-  LOCATION ('pxf://namenode_host:51200/babies_1940s'
-            '?FRAGMENTER=org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter'
-            '&ACCESSOR=org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor'
-            '&RESOLVER=org.apache.hawq.pxf.plugins.hdfs.WritableResolver'
-            '&DATA-SCHEMA=Babies') 
-  FORMAT 'CUSTOM' (formatter='pxfwritable_import');
--- ignores the records' key
-SELECT * FROM babies_1940_2; 
+gpadmin=# CREATE EXTERNAL TABLE <table_name> ( <column_name> <data_type> [, 
...] | LIKE <other_table> )
+            LOCATION 
('pxf://<HA-nameservice>/<path-to-hdfs-file>?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro|SequenceWritable[&<custom-option>=<value>[...]]')
+         FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);
 ```
 
+The opposite is true when a highly available HDFS cluster is reverted to a 
single NameNode configuration. In that case, any table definition that has the 
\<HA-nameservice\> specified should use the \<host\>[:\<port\>] syntax.Â 
+
+
 ## <a id="hdfs_advanced"></a>Advanced
-If you find that the pre-defined PXF HDFS profiles do not meet your needs, you 
may choose to create a custom HDFS profile from the existing HDFS Accessors and 
Resolvers. Refer to [XX]() for information on
\ No newline at end of file
+If you find that the pre-defined PXF HDFS profiles do not meet your needs, you 
may choose to create a custom HDFS profile from the existing HDFS Accessors and 
Resolvers. Refer to [Adding and Updating 
Profiles](ReadWritePXF.html#addingandupdatingprofiles) for information on 
creating a custom profile.
\ No newline at end of file

[02/14] incubator-hawq-docs git commit: more content and rearranging of pxf hdfs plugin page

Reply via email to