[01/14] incubator-hawq-docs git commit: start restructuring HDFS plug-in page

yozie Wed, 26 Oct 2016 11:31:50 -0700

Repository: incubator-hawq-docs
Updated Branches:
  refs/heads/develop f335de127 -> 5673447e0



start restructuring HDFS plug-in page


Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/9ca27792
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/9ca27792
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/9ca27792

Branch: refs/heads/develop
Commit: 9ca277927bebd9c8d79bdf4619dfaf94a695c838
Parents: a819abd
Author: Lisa Owen <[email protected]>
Authored: Fri Oct 14 15:29:22 2016 -0700
Committer: Lisa Owen <[email protected]>
Committed: Fri Oct 14 15:29:22 2016 -0700

----------------------------------------------------------------------
 pxf/HDFSFileDataPXF.html.md.erb | 622 +++++++++++++++++++++--------------
 1 file changed, 373 insertions(+), 249 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/9ca27792/pxf/HDFSFileDataPXF.html.md.erb
----------------------------------------------------------------------
diff --git a/pxf/HDFSFileDataPXF.html.md.erb b/pxf/HDFSFileDataPXF.html.md.erb
index 99c27ba..e1c621f 100644
--- a/pxf/HDFSFileDataPXF.html.md.erb
+++ b/pxf/HDFSFileDataPXF.html.md.erb
@@ -2,134 +2,403 @@
 title: Accessing HDFS File Data
 ---
 
-## <a id="installingthepxfhdfsplugin"></a>Prerequisites
+HDFS is the primary distributed storage mechanism used by Apache Hadoop 
applications. The PXF HDFS plug-in reads file data stored in HDFS.  The plug-in 
supports plain delimited and comma-separated-value text files.  The HDFS 
plug-in also supports Avro and SequenceFile binary formats.
 
-Before working with HDFS file data using HAWQ and PXF, you should perform the 
following operations:
+This section describes how to use PXF to access HDFS data, including how to 
create and query an external table from files in the HDFS data store.
 
--   Test PXF on HDFS before connecting to Hive or HBase.
--   EnsureÂ that all HDFS users have read permissions to HDFS services and 
that write permissions have been limited to specific users.
+## <a id="hdfsplugin_prereq"></a>Prerequisites
 
-## <a id="syntax1"></a>Syntax
+Before working with HDFS file data using HAWQ and PXF, ensure that:
 
-The syntax for creating an external HDFS file is as follows:Â 
+-   The HDFS plug-in is installed on all cluster nodes.
+-   All HDFS users have read permissions to HDFS services and that write 
permissions have been restricted to specific users.
 
-``` sql
-CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
-    ( column_name data_type [, ...] | LIKE other_table )
-LOCATION ('pxf://host[:port]/path-to-data?<pxf 
parameters>[&custom-option=value...]')
-      FORMAT '[TEXT | CSV | CUSTOM]' (<formatting_properties>);
+## <a id="hdfsplugin_fileformats"></a>HDFS File Formats
+
+The PXF HDFS plug-in supports the following file formats:
+
+- TextFile - comma-separated value or delimited format plain text file
+- SequenceFile - flat file consisting of binary key/value pairs
+- Avro - JSON-defined, schema-based data serialization format
+
+The PXF HDFS plug-in includes the following profiles to support the file 
formats listed above:
+
+- `HdfsTextSimple`
+- `HdfsTextMulti`
+- `SequenceWritable`
+- `Avro`
+
+## <a id="hdfsplugin_datatypemap"></a>Data Type Mapping
+jjj
+
+
+## <a id="hdfsplugin_cmdline"></a>HDFS Shell Commands
+HAWQ includes command-line tools that interact directly with HDFS.  These 
tools support typical file system operations including copying, listing, 
changing file permissions, etc. 
+
+The HDFS file system command is `hdfs dfs <options> [<file>]`. Invoked with no 
options, `hdfs dfs` lists the file system options supported by the tool.
+
+`hdfs dfs` options used in this section are listed in the table below:
+
+| Option  | Description |
+|-------|-------------------------------------|
+| `-cat`    | Display file contents |
+| `-mkdir`    | Create directory in HDFS |
+| `-put`    | Copy file from local file system to HDFS |
+
+Create an HDFS directory for PXF example data files:
+
+``` shell
+$ sudo -u hdfs hdfs dfs -mkdir -p /data/pxf_examples
 ```
 
-where `<pxf parameters>` is:
+Create a delimited plain text file:
+
+``` shell
+$ vi /tmp/pxf_hdfs_ts.txt
+```
+
+Add the following data to `pxf_hdfs_ts.txt`:
 
 ``` pre
-   FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
- | PROFILE=profile-name
+Prague,Jan,101,4875.33
+Rome,Mar,87,1557.39
+Bangalore,May,317,8936.99
+Beijing,Jul,411,11600.67
+```
+
+Notice the use of the comma `,` to separate field values.
+
+Add the data file to HDFS:
+
+``` shell
+$ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_ts.txt /data/pxf_examples/
+```
+
+Display the contents of `pxf_hdfs_ts.txt` stored in HDFS:
+
+``` shell
+$ sudo -u hdfs hdfs dfs -cat /data/pxf_examples/pxf_hdfs_ts.txt
+```
+
+Create a second delimited plain text file:
+
+``` shell
+$ vi /tmp/pxf_hdfs_tm.txt
+```
+
+Add the following data to `pxf_hdfs_tm.txt`:
+
+``` pre
+"4627 Star Rd.
+San Francisco, CA  94107":Sept:2017
+"113 Moon St.
+San Diego, CA  92093":Jan:2018
+"51 Belt Ct.
+Denver, CO  90123":Dec:2016
+"93114 Radial Rd.
+Chicago, IL  60605":Jul:2017
+"7301 Brookview Ave.
+Columbus, OH  43213":Dec:2018
+```
+
+Notice the use of the colon `:` to separate field values. Also notice the 
quotes around the first/address field. This field includes an embedded line 
feed.
+
+Add the data file to HDFS:
+
+``` shell
+$ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_tm.txt /data/pxf_examples/
 ```
 
-**Note:** Omit the `FRAGMENTER` parameter for `READABLE` external tables.
+You will use these HDFS files in later sections.
+
+## <a id="hdfsplugin_queryextdata"></a>Querying External HDFS Data
+The PXF HDFS plug-in supports several HDFS-related profiles. These include 
`HdfsTextSimple`, `HdfsTextMulti`, `SequenceWritable`, and `Avro`.
 
-Use an SQL `SELECT` statement to read from an HDFS READABLE table:
+Use the following syntax to create a HAWQ external table representing HDFS 
data:Â 
 
 ``` sql
-SELECT ... FROM table_name;
+CREATE EXTERNAL TABLE <table_name> 
+    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
+LOCATION ('pxf://<host>[:<port>]/<path-to-hdfs-file>
+    
?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro|SequenceWritable[&<custom-option>=<value>[...]]')
+FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);
 ```
 
-Use an SQL `INSERT` statement to add data to an HDFS WRITABLE table:
+HDFS-plug-in-specific keywords and values used in the [CREATE EXTERNAL 
TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described in the 
table below.
+
+**Note**: Some profile-specific options and properties may be discussed in the 
relevant profile section later in this topic.
+
+| Keyword  | Value |
+|-------|-------------------------------------|
+| host    | The HDFS NameNode. |
+| \<path-to-hdfs-file\>    | path to the file in the HDFS data store |
+| PROFILE    | The `PROFILE` keyword must specify one of the values 
`HdfsTextSimple`, `HdfsTextMulti`, `SequenceWritable` or `Avro`. |
+| \<custom-option\>  | \<custom-option\> is profile-specific. |
+| FORMAT 'TEXT' | Use `TEXT` `FORMAT` with the `HdfsTextSimple` profile when 
\<path-to-hdfs-file\> references a plain text delimited file.  |
+| FORMAT 'CSV' | Use `CSV` `FORMAT` with `HdfsTextSimple` and `HdfsTextMulti` 
profiles when \<path-to-hdfs-file\> references a comma-separated value file.  |
+| FORMAT 'CUSTOM' | Use the`CUSTOM` `FORMAT` with `Avro` and 
`SequenceWritable` profiles. The `CUSTOM` format supports only the built-in 
`formatter='pxfwritable_export'` \<formatting-property\>.  |
+| \<formatting-properties\> | \<formatting-properties\> are profile-specific. |
+
+Note: When creating PXF external tables, you cannot use the `HEADER` option in 
your `FORMAT` specification.
+
+### <a id="profile_hdfstextsimple"></a>HdfsTextSimple Profile
+
+Use the `HdfsTextSimple` profile when reading plain text delimited or csv 
files where each row is a single record.
+
+The following SQL call uses the PXF `HdfsTextSimple` profile to create a 
queryable HAWQ external table from the `pxf_hdfs_ts.txt` file you created and 
added to HDFS in an earlier section:
 
 ``` sql
-INSERT INTO table_name ...;
+gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple(location text, month text, 
num_orders int, total_sales float8)
+            LOCATION 
('pxf://namenode:51200/data/pxf_examples/pxf_hdfs_ts.txt?PROFILE=HdfsTextSimple')
 
+          FORMAT 'TEXT' (delimiter=E',');
+gpadmin=# SELECT * FROM pxf_hdfs_textsimple;          
 ```
 
-To read the data in the files or to write based on theÂ existing format, use 
`FORMAT`, `PROFILE`, or one of the classes.
+``` pre
+   location    | month | num_orders | total_sales 
+---------------+-------+------------+-------------
+ Prague        | Jan   |        101 |     4875.33
+ Rome          | Mar   |         87 |     1557.39
+ Bangalore     | May   |        317 |     8936.99
+ Beijing       | Jul   |        411 |    11600.67
+(4 rows)
+```
 
-This topic describes the following:
+Create a second external table from `pxf_hdfs_ts.txt`, this time using the 
`CSV` `FORMAT`:
 
--   FORMAT clause
--   Profile
--   Accessor
--   Resolver
--   Avro
+``` sql
+gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple_csv(location text, month 
text, num_orders int, total_sales float8)
+            LOCATION 
('pxf://namenode:51200/data/pxf_examples/pxf_hdfs_ts.txt?PROFILE=HdfsTextSimple')
 
+          FORMAT 'CSV';
+gpadmin=# SELECT * FROM pxf_hdfs_textsimple_csv;          
+```
 
-**Note:** For more details about the API and classes, see [PXF External Tables 
and API](PXFExternalTableandAPIReference.html#pxfexternaltableandapireference).
+When specifying `FORMAT 'CSV'` for a comma-separated value file, no 
`delimiter` formatter option is required, as comma is the default delimiter.
 
-### <a id="formatclause"></a>FORMAT clause
+### <a id="profile_hdfstextmulti"></a>HdfsTextMulti Profile
 
-Use one of the following formats to read data with any PXF connector:
+Use the `HdfsTextMulti` profile when reading plain text files with delimited 
single- or multi- line records that include embedded (quoted) linefeed 
characters.
 
--   `FORMAT 'TEXT'`: Use with plain delimited text files on HDFS.
--   `FORMAT 'CSV'`: Use with comma-separated value files on HDFS.
--   `FORMAT 'CUSTOM'`: Use with all other files, including Avro format and 
binary formats. Must always be used with the built-in formatter 
'`pxfwritable_import`' (for read) or '`pxfwritable_export`' (for write).
+The following SQL call uses the PXF `HdfsTextMulti` profile to create a 
queryable HAWQ external table from the `pxf_hdfs_tm.txt` file you created and 
added to HDFS in an earlier section:
 
-**Note:** When creating PXF external tables, you cannot use the `HEADER` 
option in your `FORMAT` specification.
+``` sql
+gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_textmulti(address text, month text, 
year int)
+            LOCATION 
('pxf://sandbox.hortonworks.com:51200/data/pxf_examples/pxf_hdfs_tm.txt?PROFILE=HdfsTextMulti')
 
+          FORMAT 'CSV' (delimiter=E':');
+gpadmin=# SELECT * FROM pxf_hdfs_textmulti;
+```
 
-### <a id="topic_ab2_sxy_bv"></a>Profile
+``` pre
+         address          | month | year 
+--------------------------+-------+------
+ 4627 Star Rd.            | Sept  | 2017
+ San Francisco, CA  94107           
+ 113 Moon St.             | Jan   | 2018
+ San Diego, CA  92093               
+ 51 Belt Ct.              | Dec   | 2016
+ Denver, CO  90123                  
+ 93114 Radial Rd.         | Jul   | 2017
+ Chicago, IL  60605                 
+ 7301 Brookview Ave.      | Dec   | 2018
+ Columbus, OH  43213                
+(5 rows)
+```
 
-For plain or comma-separated text files in HDFS use eitherÂ the 
`HdfsTextSimple` orÂ `HdfsTextMulti` Profile, or the classname 
org.apache.hawq.pxf.plugins.hdfs.*HdfsDataFragmenter*. Use the `Avro` profile 
for Avro files. See [Using Profiles to Read and Write 
Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information.
+### <a id="profile_hdfsseqwritable"></a>SequenceWritable Profile 
 
-**Note:** For read tables, you must include a Profile or a Fragmenter in the 
table definition.
+Use the `SequenceWritable` profile when reading SequenceFile format files. 
Files of this type consist of binary key/value pairs. Sequence files are a 
common data transfer format between MapReduce jobs. 
 
-### <a id="accessor"></a>Accessor
+The `SequenceWritable` profile supports the following \<custom-options\> or 
\<formatting-properties\>:
 
-The choice of an Accessor depends on the HDFS data file type.Â 
+| Keyword  | Value Description |
+|-------|-------------------------------------|
+| COMPRESSION_CODEC    | Specifies the compression codec Java class name  |
+| COMPRESSION_TYPE    | The compression type of the sequence file; supported 
values are `RECORD` (the default) or `BLOCK`. |
+| \<path-to-hdfs-file\>    | path to the file in the HDFS data store |
 
-**Note:** You must include either a Profile or anÂ AccessorÂ in the table 
definition.
+MORE HERE
+
+### <a id="profile_hdfsavro"></a>Avro Profile
+
+Avro files store metadata with the data. Avro files also allow specification 
of an independent schema used when reading the file. 
+
+#### <a id="profile_hdfsavrodatamap"></a>Data Type Mapping
+
+To represent Avro data in HAWQ, map data values that use a primitive data type 
to HAWQ columns of the same type. 
+
+Avro supports complex data types including arrays, maps, records, 
enumerations, and fixed types. Map top-level fields of these complex data types 
to the HAWQ `TEXT` type. While HAWQ does not natively support these types, you 
can create HAWQ functions or application code to extract or further process 
subcomponents of these complex data types.
+
+The following table summarizes external mapping rules for Avro data.
+
+<caption><span class="tablecap">Table 2. Avro Data Type 
Mapping</span></caption>
+
+<a id="topic_oy3_qwm_ss__table_j4s_h1n_ss"></a>
+
+| Avro Data Type                                                    | PXF Type 
                                                                                
                                                                                
                           |
+|-------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Primitive type (int, double, float, long, string, bytes, boolean) | Use the 
corresponding HAWQ built-in data type; see [Data 
Types](../reference/HAWQDataTypes.html). |
+| Complex type: Array, Map, Record, or Enum                         | TEXT, 
with delimiters inserted between collection items, mapped key-value pairs, and 
record data.                                                                    
                       |
+| Complex type: Fixed                                               | BYTEA    
                                                                                
                                                                                
                           |
+| Union                                                             | Follows 
the above conventions for primitive or complex data types, depending on the 
union; supports Null values.                                                    
                 |
+
+#### <a id="profile_hdfsavroptipns"></a>Avro-Specific Formatting Options
+
+For complex types, the PXF Avro profile inserts default delimiters between 
collection items and values. You can use non-default delimiter characters by 
identifying values for specific Avro custom options in the `CREATE EXTERNAL 
TABLE` call. 
+
+The Avro profile supports the following custom options:
+
+<caption><span class="tablecap">Table N. Avro Formatting 
Options</span></caption>
+
+| Option Name   | Description       
+|---------------|--------------------|                                         
                                               
+| COLLECTION_DELIM | The delimiter character(s) to place between entries in a 
top-level array, map, or record field when PXF maps a Avro complex data type to 
a text column. The default is a comma `,` character. |
+| MAPKEY_DELIM | The delimiter character(s) to place between the key and value 
of a map entry when PXF maps an Avro complex data type to a text column. The 
default is a colon `:` character. |
+| RECORDKEY_DELIM | The delimiter character(s) to place between the field name 
and value of a record entry when PXF maps an Avro complex data type to a text 
colum. The default is a colon `:` character. |
+| SCHEMA-DATA | The data schema file used to create and readÂ the HDFS file. 
For an Avro file, you may create an avsc. This option has no default value. |
+| THREAD-SAFE | Determines if the table query can run in multi-thread mode or 
not. Allowed values are `TRUE`, `FALSE`. Default value is `TRUE` - requests can 
run in multithread mode. When set to FALSE, requests will be handled in a 
single thread. |
+
+#### <a id="topic_tr3_dpg_ts__section_m2p_ztg_ts"></a>Avro Schemas
+
+Avro schemas are defined using JSON, and composed of the same primitive and 
complex types identified in the data mapping section above. Avro schema file 
typically have a `.avsc` suffix.
+
+Fields in an Avro schema file are defined via an array of objects, each of 
which is specified by a name and a type. The field type is another schema object
+
+The examples in this section will be operating on Avro data with the following 
record schema:
+
+- id - long
+- username - string
+- followers - array of string
+- fmap - map of long
+- address - record comprised of street number (int), street name (string), and 
city (string)
+- relationship - enumerated type
+
+Create an Avro schema file to represent the above schema:
+
+``` shell
+$ vi /tmp/avro_schema.avsc
+```
+
+Copy and paste the following text into `avro_schema.avsc`:
+
+``` json
+{
+"type" : "record",
+  "name" : "example_schema",
+  "namespace" : "com.example",
+  "fields" : [ {
+    "name" : "id",
+    "type" : "long",
+    "doc" : "Id of the user account"
+  }, {
+    "name" : "username",
+    "type" : "string",
+    "doc" : "Name of the user account"
+  }, {
+    "name" : "followers",
+    "type" : {"type": "array", "items": "string"},
+    "doc" : "Users followers"
+  }, {
+    "name": "fmap",
+    "type": {"type": "map", "values": "long"}
+  }, {
+    "name": "relationship",
+    "type": {
+        "type": "enum",
+        "name": "relationshipEnum",
+        "symbols": ["MARRIED","LOVE","FRIEND","COLLEAGUE","STRANGER","ENEMY"]
+    }
+  }, {
+    "name": "address",
+    "type": {
+        "type": "record",
+        "name": "addressRecord",
+        "fields": [
+            {"name":"number", "type":"int"},
+            {"name":"street", "type":"string"},
+            {"name":"city", "type":"string"}]
+    }
+  } ],
+  "doc:" : "A basic schema for storing messages"
+}
+```
+
+An Avro schema, together with its data, is fully self-describing.  
+
+#### <a id="topic_tr3_dpg_ts__section_spk_15g_ts"></a>Sample Avro Data (JSON)
+
+Create a text file named `pxf_hdfs_avro.txt`:
+
+``` shell
+$ vi /tmp/pxf_hdfs_avro.txt
+```
+
+Enter the following data into `pxf_hdfs_avro.txt`:
+
+``` pre
+{"id":1, "username":"john","followers":["kate", "santosh"], "relationship": 
"FRIEND", "fmap": {"kate":10,"santosh":4}, "address":{"number":1, 
"street":"renaissance drive", "city":"san jose"}}
+    
+{"id":2, "username":"jim","followers":["john", "pam"], "relationship": 
"COLLEAGUE", "fmap": {"john":3,"pam":3}, "address":{"number":9, "street":"deer 
creek", "city":"palo alto"}}
+```
+
+The sample data uses a comma `,` to separate top level records and a colon `:` 
to separate map/key values and record field name/values.
+
+Convert the text file to Avro format. There are various ways to perform the 
conversion programmatically and via the command line. In this example, we are 
using the [Java Avro tools](http://avro.apache.org/releases.html) and the jar 
file resides in the current directory:
+
+``` shell
+$ java -jar ./avro-tools-1.8.1.jar fromjson --schema-file 
/tmp/avro_schema.avsc /tmp/pxf_hdfs_avro.txt > /tmp/pxf_hdfs_avro.avro
+```
+
+The generated Avro binary data file is written to `/tmp/pxf_hdfs_avro.avro`. 
Copy this file to HDFS:
+
+``` shell
+$ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_avro.avro /data/pxf_examples/
+```
+
+Create a queryable external table from this Avro file:
+
+-  Map the top-level primitive fields, `id` (type long) and `username` (type 
string), to their equivalent HAWQ types (bigint and text). 
+-  Map the remaining complex fields to type text.
+-  Explicitly set the record, map, and collection delimiters using the Avro 
profile custom options:
+
+``` sql
+gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_avro(id bigint, username text, 
followers text, fmap text, relationship text, address text)
+            LOCATION 
('pxf://sandbox.hortonworks.com:51200/data/pxf_examples/pxf_hdfs_avro.avro?PROFILE=Avro&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:')
+          FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
+```
+
+A simple query of the external table shows the components of the complex type 
data separated with delimiters:
+
+``` sql
+gpadmin=# SELECT * FROM pxf_hdfs_avro;
+```
+
+``` pre
+ id | username |   followers    |        fmap         | relationship |         
             address                      
+----+----------+----------------+--------------------+--------------+---------------------------------------------------
+  1 | john     | [kate,santosh] | {kate:10,santosh:4} | FRIEND       | 
{number:1,street:renaissance drive,city:san jose}
+  2 | jim      | [john,pam]     | {pam:3,john:3}      | COLLEAGUE    | 
{number:9,street:deer creek,city:palo alto}
+(2 rows)
+```
+
+Process the delimited components in the text columns as necessary for your 
application. For example, the following command uses the `string_to_array` 
function to convert entries in the `followers` field to a text array column in 
a new view. The view is then queried to filter rows based on whether a 
particular follower appears in the array:
+
+``` sql
+gpadmin=# CREATE VIEW followers_view AS 
+  SELECT username, address, string_to_array(substring(followers FROM 2 FOR 
(char_length(followers) - 2)), ',')::text[] 
+    AS followers 
+  FROM pxf_hdfs_avro;
+
+gpadmin=# SELECT username, address FROM followers_view WHERE followers @> 
'{john}'
+```
+
+``` pre
+ username |                   address                   
+----------+---------------------------------------------
+ jim      | {number:9,street:deer creek,city:palo alto}
+```
+
+~~XXXXX
 
-<table>
-<colgroup>
-<col width="25%" />
-<col width="25%" />
-<col width="25%" />
-<col width="25%" />
-</colgroup>
-<thead>
-<tr class="header">
-<th>File Type</th>
-<th>Accessor</th>
-<th>FORMAT clause</th>
-<th>Comments</th>
-</tr>
-</thead>
-<tbody>
-<tr class="odd">
-<td>Plain Text delimited</td>
-<td>org.apache.hawq.pxf.plugins.âhdfs.LineBreakAccessor</td>
-<td>FORMAT 'TEXT' (<em>format param list</em>)</td>
-<td>Â Read + Write
-<p>You cannot use the <code class="ph codeph">HEADER</code> option.</p></td>
-</tr>
-<tr class="even">
-<td>Plain Text CSVÂ </td>
-<td>org.apache.hawq.pxf.plugins.âhdfs.LineBreakAccessor</td>
-<td>FORMAT 'CSV' (<em>format param list</em>)Â </td>
-<td><p>LineBreakAccessor is parallel and faster.</p>
-<p>Use if each logical data row is a physical data line.</p>
-<p>Read + WriteÂ </p>
-<p>You cannot use the <code class="ph codeph">HEADER</code> option.</p></td>
-</tr>
-<tr class="odd">
-<td>Plain Text CSVÂ </td>
-<td>org.apache.hawq.pxf.plugins.âhdfs.QuotedLineBreakAccessor</td>
-<td>FORMAT 'CSV' (<em>format param list</em>)Â </td>
-<td><p>QuotedLineBreakAccessor is slower and non-parallel.</p>
-<p>Use if the data includes embedded (quoted) linefeed characters.</p>
-<p>Read OnlyÂ </p>
-<p>You cannot use the <code class="ph codeph">HEADER</code> option.</p></td>
-</tr>
-<tr class="even">
-<td>SequenceFile</td>
-<td>org.apache.hawq.pxf.plugins.âhdfs.SequenceFileAccessor</td>
-<td>FORMAT 'CUSTOM' (formatter='pxfwritable_import')</td>
-<td>Â Read + Write (use formatter='pxfwritable_export' for write)</td>
-</tr>
-<tr class="odd">
-<td>AvroFile</td>
-<td>org.apache.hawq.pxf.plugins.âhdfs.AvroFileAccessor</td>
-<td>FORMAT 'CUSTOM' (formatter='pxfwritable_import')</td>
-<td>Â Read Only</td>
-</tr>
-</tbody>
-</table>
 
 ### <a id="resolver"></a>Resolver
 
@@ -274,17 +543,19 @@ The class file must follow the following requirements:
 </tbody>
 </table>
 
-## <a id="accessingdataonahighavailabilityhdfscluster"></a>Accessing Data on a 
High Availability HDFS Cluster
+## <a id="accessingdataonahighavailabilityhdfscluster"></a>Accessing HDFS Data 
in a High Availability HDFS Cluster
 
-ToÂ access data on a High Availability HDFS cluster, change the authorityÂ in 
the URI in the LOCATION. Use *HA\_nameservice* instead of 
*name\_node\_host:51200*.
+ToÂ access data in a High Availability HDFS cluster, change the \<host\>Â 
provided in the URI LOCATION clause. Use *HA\_nameservice* rather than  
*name\_node\_host:51200*.
 
 ``` sql
-CREATE [READABLE|WRITABLE] EXTERNAL TABLE <tbl name> (<attr list>)
-LOCATION ('pxf://<HA nameservice>/<path to file or 
directory>?Profile=profile[&<additional options>=<value>]')
-FORMAT '[TEXT | CSV | CUSTOM]' (<formatting properties>);
+CREATE EXTERNAL TABLE <table_name> 
+    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
+LOCATION ('pxf://<HA-nameservice>/<path-to-hdfs-file>
+    
?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro|SequenceWritable[&<custom-option>=<value>[...]]')
+FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);
 ```
 
-The opposite is true when a highly available HDFS cluster is reverted to a 
single namenode configuration. In that case, any table definition that has the 
nameservice specified should use the &lt;NN host&gt;:&lt;NN rest port&gt; 
syntax.Â 
+The opposite is true when a highly available HDFS cluster is reverted to a 
single namenode configuration. In that case, any table definition that has the 
\<HA-nameservice\> specified should use the \<host\>[:\<port\>] syntax.Â 
 
 ## <a id="recordkeyinkey-valuefileformats"></a>Using a Record Key with 
Key-Value File Formats
 
@@ -356,152 +627,5 @@ CREATE EXTERNAL TABLE babies_1940_2 (name text, birthday 
text, weight float)
 SELECT * FROM babies_1940_2; 
 ```
 
-## <a id="topic_oy3_qwm_ss"></a>Working with Avro Files
-
-Avro files combine their data with a schema, and can support complex data 
types such as arrays, maps, records, enumerations, and fixed types. When you 
create a PXF external table to represent Avro data, map top-level fields in the 
schema that use a primitive data type to HAWQ columns of the same type. Map 
top-level fields that use a complex data type to a TEXT column in the external 
table. The PXF Avro profile automatically separates components of a complex 
type by inserting delimiters in the text column. You can then use functions or 
application code to further process components of the complex data.
-
-The following table summarizes external table mapping rules for Avro data.
-
-<caption><span class="tablecap">Table 2. Avro Data Type 
Mapping</span></caption>
-
-<a id="topic_oy3_qwm_ss__table_j4s_h1n_ss"></a>
-
-| Avro Data Type                                                    | PXF Type 
                                                                                
                                                                                
                           |
-|-------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Primitive type (int, double, float, long, string, bytes, boolean) | 
Corresponding HAWQ data type. See [Data 
Types](../reference/HAWQDataTypes.html). |
-| Complex type: Array, Map, Record, or Enum                         | TEXT, 
with default delimiters inserted between collection items, mapped key-value 
pairs, and record data.                                                         
                                  |
-| Complex type: Fixed                                               | BYTEA    
                                                                                
                                                                                
                           |
-| Union                                                             | Follows 
the above conventions for primitive or complex data types, depending on the 
union. Null values are supported in Unions.                                     
                                |
-
-For complex types, the PXF Avro profile inserts default delimiters between 
collection items and values. You can use non-default delimiter characters by 
including the `COLLECTION_DELIM`, `MAPKEY_DELIM`, and/or `RECORDKEY_DELIM` 
optional parameters on the Avro profile. See [Additional PXF 
Options](#additionaloptions__table_skq_kpz_4p) for a description of the 
parameters.
-
-### <a id="topic_tr3_dpg_ts"></a>Example
-
-The following example uses the Avro schema shown in [Sample Avro 
Schema](#topic_tr3_dpg_ts__section_m2p_ztg_ts) and the associated data file 
shown in [Sample Avro Data (JSON)](#topic_tr3_dpg_ts__section_spk_15g_ts).
-
-#### <a id="topic_tr3_dpg_ts__section_m2p_ztg_ts"></a>Sample Avro Schema
-
-``` json
-{
-  "type" : "record",
-  "name" : "example_schema",
-  "namespace" : "com.example",
-  "fields" : [ {
-    "name" : "id",
-    "type" : "long",
-    "doc" : "Id of the user account"
-  }, {
-    "name" : "username",
-    "type" : "string",
-    "doc" : "Name of the user account"
-  }, {
-    "name" : "followers",
-    "type" : {"type": "array", "items": "string"},
-    "doc" : "Users followers"
-  }, {
-    "name": "rank",
-    "type": ["null", "int"],
-    "default": null
-  }, {
-    "name": "fmap",
-    "type": {"type": "map", "values": "long"}
-  }, {
-    "name": "address",
-    "type": {
-        "type": "record",
-        "name": "addressRecord",
-        "fields": [
-            {"name":"number", "type":"int"},
-            {"name":"street", "type":"string"},
-            {"name":"city", "type":"string"}]
-    }
-  }, {
-   "name": "relationship",
-    "type": {
-        "type": "enum",
-        "name": "relationshipEnum",
-        "symbols": ["MARRIED","LOVE","FRIEND","COLLEAGUE","STRANGER","ENEMY"]
-    }
-  }, {
-    "name" : "md5",
-    "type": {
-        "type" : "fixed",
-        "name" : "md5Fixed",
-        "size" : 4
-    }
-  } ],
-  "doc:" : "A basic schema for storing messages"
-}
-```
-
-#### <a id="topic_tr3_dpg_ts__section_spk_15g_ts"></a>Sample Avro Data (JSON)
-
-``` pre
-{"id":1, "username":"john","followers":["kate", "santosh"], "rank":null, 
"relationship": "FRIEND", "fmap": {"kate":10,"santosh":4},
-"address":{"street":"renaissance drive", "number":1,"city":"san jose"}, 
"md5":\u3F00\u007A\u0073\u0074}
-
-{"id":2, "username":"jim","followers":["john", "pam"], "rank":3, 
"relationship": "COLLEAGUE", "fmap": {"john":3,"pam":3}, 
-"address":{"street":"deer creek", "number":9,"city":"palo alto"}, 
"md5":\u0010\u0021\u0003\u0004}
-```
-
-To map this Avro file to an external table, the top-level primitive fields 
("id" of type long and "username" of type string) are mapped to their 
equivalent HAWQ types (bigint and text). The remaining complex fields are 
mapped to text columns:
-
-``` sql
-gpadmin=# CREATE EXTERNAL TABLE avro_complex 
-  (id bigint, 
-  username text, 
-  followers text, 
-  rank int, 
-  fmap text, 
-  address text, 
-  relationship text,
-  md5 bytea) 
-LOCATION ('pxf://namehost:51200/tmp/avro_complex?PROFILE=Avro')
-FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
-```
-
-The above command uses default delimiters for separating components of the 
complex types. This command is equivalent to the one above, but it explicitly 
sets the delimiters using the Avro profile parameters:
-
-``` sql
-gpadmin=# CREATE EXTERNAL TABLE avro_complex 
-  (id bigint, 
-  username text, 
-  followers text, 
-  rank int, 
-  fmap text, 
-  address text, 
-  relationship text,
-  md5 bytea) 
-LOCATION 
('pxf://localhost:51200/tmp/avro_complex?PROFILE=Avro&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:')
-FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
-```
-
-A simple query of the external table shows the components of the complex type 
data separated with delimiters:
-
-``` sql
-gpadmin=# select * from avro_complex;
-```
-
-``` pre
-id | username |  followers  |    rank     |  fmap   |    address  |  
relationship  |  md5
-1| john | [kate,santosh] |   | {kate:10,santosh:4} | 
{number:1,street:renaissance drive,city:san jose} | FRIEND | ?zst
-2| jim | [john,pam] | 3 | {john:3,pam:3} | {number:9,street:deer 
creek,city:palo alto} | COLLEAGUE | \020!\003\004
-```
-
-You can process the delimited components in the text columns as necessary for 
your application. For example, the following command uses the `string_to_array` 
function to convert entries in the "followers" field to a text array column in 
a new view. The view is then queried to filter rows based on whether a 
particular follower appears in the array:
-
-``` sql
-gpadmin=# create view followers_view as 
-  select username, address, string_to_array(substring(followers from 2 for 
(char_length(followers) - 2)), ',')::text[] 
-    as followers 
-  from avro_complex;
-
-gpadmin=# select username, address 
-from followers_view 
-where john = ANY(followers);
-```
-
-``` pre
-username | address
-jim | {number:9,street:deer creek,city:palo alto}
-```
+## <a id="hdfs_advanced"></a>Advanced
+If you find that the pre-defined PXF HDFS profiles do not meet your needs, you 
may choose to create a custom HDFS profile from the existing HDFS Accessors and 
Resolvers. Refer to [XX]() for information on
\ No newline at end of file

[01/14] incubator-hawq-docs git commit: start restructuring HDFS plug-in page

Reply via email to