[GitHub] incubator-hawq-docs pull request #33: HAWQ-1107 - enhance PXF HDFS plugin do...

dyozie Tue, 25 Oct 2016 14:22:01 -0700

Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/33#discussion_r85000415
  
    --- Diff: pxf/HDFSFileDataPXF.html.md.erb ---
    @@ -2,388 +2,282 @@
     title: Accessing HDFS File Data
     ---
     
    -## <a id="installingthepxfhdfsplugin"></a>Prerequisites
    +HDFS is the primary distributed storage mechanism used by Apache Hadoop 
applications. The PXF HDFS plug-in reads file data stored in HDFS.  The plug-in 
supports plain delimited and comma-separated-value format text files.  The HDFS 
plug-in also supports the Avro binary format.
     
    -Before working with HDFS file data using HAWQ and PXF, you should perform 
the following operations:
    +This section describes how to use PXF to access HDFS data, including how 
to create and query an external table from files in the HDFS data store.
     
    --   Test PXF on HDFS before connecting to Hive or HBase.
    --   EnsureÂ that all HDFS users have read permissions to HDFS services and 
that write permissions have been limited to specific users.
    +## <a id="hdfsplugin_prereq"></a>Prerequisites
     
    -## <a id="syntax1"></a>Syntax
    +Before working with HDFS file data using HAWQ and PXF, ensure that:
     
    -The syntax for creating an external HDFS file is as follows:Â 
    +-   The HDFS plug-in is installed on all cluster nodes.
    +-   All HDFS users have read permissions to HDFS services and that write 
permissions have been restricted to specific users.
     
    -``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
    -    ( column_name data_type [, ...] | LIKE other_table )
    -LOCATION ('pxf://host[:port]/path-to-data?<pxf 
parameters>[&custom-option=value...]')
    -      FORMAT '[TEXT | CSV | CUSTOM]' (<formatting_properties>);
    -```
    +## <a id="hdfsplugin_fileformats"></a>HDFS File Formats
     
    -where `<pxf parameters>` is:
    +The PXF HDFS plug-in supports reading the following file formats:
     
    -``` pre
    -   
FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
    - | PROFILE=profile-name
    -```
    +- Text File - comma-separated value (.csv) or delimited format plain text 
file
    +- Avro - JSON-defined, schema-based data serialization format
     
    -**Note:** Omit the `FRAGMENTER` parameter for `READABLE` external tables.
    +The PXF HDFS plug-in includes the following profiles to support the file 
formats listed above:
     
    -Use an SQL `SELECT` statement to read from an HDFS READABLE table:
    +- `HdfsTextSimple` - text files
    +- `HdfsTextMulti` - text files with embedded line feeds
    +- `Avro` - Avro files
     
    -``` sql
    -SELECT ... FROM table_name;
    -```
     
    -Use an SQL `INSERT` statement to add data to an HDFS WRITABLE table:
    +## <a id="hdfsplugin_cmdline"></a>HDFS Shell Commands
    +Hadoop includes command-line tools that interact directly with HDFS.  
These tools support typical file system operations including copying and 
listing files, changing file permissions, etc. 
     
    -``` sql
    -INSERT INTO table_name ...;
    -```
    +The HDFS file system command is `hdfs dfs <options> [<file>]`. Invoked 
with no options, `hdfs dfs` lists the file system options supported by the tool.
    +
    +`hdfs dfs` options used in this section are identified in the table below:
    +
    +| Option  | Description |
    +|-------|-------------------------------------|
    +| `-cat`    | Display file contents. |
    +| `-mkdir`    | Create directory in HDFS. |
    +| `-put`    | Copy file from local file system to HDFS. |
    +
    +### <a id="hdfsplugin_cmdline_create"></a>Create Data Files
    +
    +Perform the following steps to create data files used in subsequent 
exercises:
    +
    +1. Create an HDFS directory for PXF example data files:
    +
    +    ``` shell
    +     $ sudo -u hdfs hdfs dfs -mkdir -p /data/pxf_examples
    +    ```
    +
    +2. Create a delimited plain text file:
    +
    +    ``` shell
    +    $ vi /tmp/pxf_hdfs_simple.txt
    +    ```
    +
    +3. Copy and paste the following data into `pxf_hdfs_simple.txt`:
    +
    +    ``` pre
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    ```
    +
    +    Notice the use of the comma `,` to separate the four data fields.
    +
    +4. Add the data file to HDFS:
    +
    +    ``` shell
    +    $ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_simple.txt 
/data/pxf_examples/
    +    ```
    +
    +5. Display the contents of the `pxf_hdfs_simple.txt` file stored in HDFS:
    +
    +    ``` shell
    +    $ sudo -u hdfs hdfs dfs -cat /data/pxf_examples/pxf_hdfs_simple.txt
    +    ```
    +
    +6. Create a second delimited plain text file:
    +
    +    ``` shell
    +    $ vi /tmp/pxf_hdfs_multi.txt
    +    ```
     
    -To read the data in the files or to write based on theÂ existing format, 
use `FORMAT`, `PROFILE`, or one of the classes.
    -
    -This topic describes the following:
    -
    --   FORMAT clause
    --   Profile
    --   Accessor
    --   Resolver
    --   Avro
    -
    -**Note:** For more details about the API and classes, see [PXF External 
Tables and 
API](PXFExternalTableandAPIReference.html#pxfexternaltableandapireference).
    -
    -### <a id="formatclause"></a>FORMAT clause
    -
    -Use one of the following formats to read data with any PXF connector:
    -
    --   `FORMAT 'TEXT'`: Use with plain delimited text files on HDFS.
    --   `FORMAT 'CSV'`: Use with comma-separated value files on HDFS.
    --   `FORMAT 'CUSTOM'`: Use with all other files, including Avro format and 
binary formats. Must always be used with the built-in formatter 
'`pxfwritable_import`' (for read) or '`pxfwritable_export`' (for write).
    -
    -**Note:** When creating PXF external tables, you cannot use the `HEADER` 
option in your `FORMAT` specification.
    -
    -### <a id="topic_ab2_sxy_bv"></a>Profile
    -
    -For plain or comma-separated text files in HDFS use eitherÂ the 
`HdfsTextSimple` orÂ `HdfsTextMulti` Profile, or the classname 
org.apache.hawq.pxf.plugins.hdfs.*HdfsDataFragmenter*. Use the `Avro` profile 
for Avro files. See [Using Profiles to Read and Write 
Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information.
    -
    -**Note:** For read tables, you must include a Profile or a Fragmenter in 
the table definition.
    -
    -### <a id="accessor"></a>Accessor
    -
    -The choice of an Accessor depends on the HDFS data file type.Â 
    -
    -**Note:** You must include either a Profile or anÂ AccessorÂ in the table 
definition.
    -
    -<table>
    -<colgroup>
    -<col width="25%" />
    -<col width="25%" />
    -<col width="25%" />
    -<col width="25%" />
    -</colgroup>
    -<thead>
    -<tr class="header">
    -<th>File Type</th>
    -<th>Accessor</th>
    -<th>FORMAT clause</th>
    -<th>Comments</th>
    -</tr>
    -</thead>
    -<tbody>
    -<tr class="odd">
    -<td>Plain Text delimited</td>
    -<td>org.apache.hawq.pxf.plugins.âhdfs.LineBreakAccessor</td>
    -<td>FORMAT 'TEXT' (<em>format param list</em>)</td>
    -<td>Â Read + Write
    -<p>You cannot use the <code class="ph codeph">HEADER</code> 
option.</p></td>
    -</tr>
    -<tr class="even">
    -<td>Plain Text CSVÂ </td>
    -<td>org.apache.hawq.pxf.plugins.âhdfs.LineBreakAccessor</td>
    -<td>FORMAT 'CSV' (<em>format param list</em>)Â </td>
    -<td><p>LineBreakAccessor is parallel and faster.</p>
    -<p>Use if each logical data row is a physical data line.</p>
    -<p>Read + WriteÂ </p>
    -<p>You cannot use the <code class="ph codeph">HEADER</code> 
option.</p></td>
    -</tr>
    -<tr class="odd">
    -<td>Plain Text CSVÂ </td>
    -<td>org.apache.hawq.pxf.plugins.âhdfs.QuotedLineBreakAccessor</td>
    -<td>FORMAT 'CSV' (<em>format param list</em>)Â </td>
    -<td><p>QuotedLineBreakAccessor is slower and non-parallel.</p>
    -<p>Use if the data includes embedded (quoted) linefeed characters.</p>
    -<p>Read OnlyÂ </p>
    -<p>You cannot use the <code class="ph codeph">HEADER</code> 
option.</p></td>
    -</tr>
    -<tr class="even">
    -<td>SequenceFile</td>
    -<td>org.apache.hawq.pxf.plugins.âhdfs.SequenceFileAccessor</td>
    -<td>FORMAT 'CUSTOM' (formatter='pxfwritable_import')</td>
    -<td>Â Read + Write (use formatter='pxfwritable_export' for write)</td>
    -</tr>
    -<tr class="odd">
    -<td>AvroFile</td>
    -<td>org.apache.hawq.pxf.plugins.âhdfs.AvroFileAccessor</td>
    -<td>FORMAT 'CUSTOM' (formatter='pxfwritable_import')</td>
    -<td>Â Read Only</td>
    -</tr>
    -</tbody>
    -</table>
    -
    -### <a id="resolver"></a>Resolver
    -
    -Choose the Resolver format if data records are serialized in the HDFS 
file.Â 
    -
    -**Note:** You must include a Profile or a Resolver in the table definition.
    -
    -<table>
    -<colgroup>
    -<col width="33%" />
    -<col width="33%" />
    -<col width="33%" />
    -</colgroup>
    -<thead>
    -<tr class="header">
    -<th>Record Serialization</th>
    -<th>Resolver</th>
    -<th>Comments</th>
    -</tr>
    -</thead>
    -<tbody>
    -<tr class="odd">
    -<td>Avro</td>
    -<td>org.apache.hawq.pxf.plugins.âhdfs.AvroResolver</td>
    -<td><ul>
    -<li>Avro files include the record schema, Avro serialization can be used 
in other file types (e.g, Sequence File).Â </li>
    -<li>For Avro serialized records outside of an Avro file, include a schema 
file name (.avsc) in the url under the optionalÂ <code class="ph 
codeph">Schema-DataÂ </code>option.</li>
    -<li>Deserialize Only (Read)Â .</li>
    -</ul></td>
    -</tr>
    -<tr class="even">
    -<td>Java Writable</td>
    -<td>org.apache.hawq.pxf.plugins.âhdfs.WritableResolver</td>
    -<td><ul>
    -<li>Include the name of the Java class that uses Writable serializationÂ 
in the URL under the optionalÂ <code class="ph codeph">Schema-Data.</code></li>
    -<li>The class file must exist in the public stage directory (or in 
Hadoop's class path).</li>
    -<li>Deserialize and Serialize (Read + Write).Â </li>
    -<li>See <a href="#customizedwritableschemafileguidelines">Customized 
Writable Schema File Guidelines</a>.</li>
    -</ul></td>
    -</tr>
    -<tr class="odd">
    -<td>None (plain text)</td>
    -<td>org.apache.hawq.pxf.plugins.âhdfs.StringPassResolver</td>
    -<td><ul>
    -<li>Does not serialize plain text records. The database parsesÂ plain 
records. Passes records as they are.</li>
    -<li>Deserialize and Serialize (Read + Write).</li>
    -</ul></td>
    -</tr>
    -</tbody>
    -</table>
    -
    -#### <a id="customizedwritableschemafileguidelines"></a>Schema File 
Guidelines for WritableResolver
    -
    -When using a WritableResolver, a schema file needs to be defined. The file 
needs to be a Java class file and must be on the class path of PXF.
    -
    -The class file must follow the following requirements:
    -
    -1.  Must implement org.apache.hadoop.io.Writable interface.
    -2.  WritableResolver uses reflection to recreate the schema and populate 
its fields (for both read and write). Then it uses the Writable interface 
functions to read/write. Therefore, fields must be public, to enable access to 
them. Private fields will be ignored.
    -3.  Fields are accessed and populated in the order in which they are 
declared in the class file.
    -4.  Supported field types:
    -    -   boolean
    -    -   byte array
    -    -   double
    -    -   float
    -    -   int
    -    -   long
    -    -   short
    -    -   string
    -
    -    Arrays of any of the above types are supported, but the constructor 
must define the array size so the reflection will work.
    -
    -### <a id="additionaloptions"></a>Additional Options
    -
    -<a id="additionaloptions__table_skq_kpz_4p"></a>
    -
    -<table>
    -<caption><span class="tablecap">Table 1. Additional PXF 
Options</span></caption>
    -<colgroup>
    -<col width="50%" />
    -<col width="50%" />
    -</colgroup>
    -<thead>
    -<tr class="header">
    -<th>Option Name</th>
    -<th>Description</th>
    -</tr>
    -</thead>
    -<tbody>
    -<tr class="odd">
    -<td>COLLECTION_DELIM</td>
    -<td>(Avro or Hive profiles only.) The delimiter character(s) to place 
between entries in a top-level array, map, or record field when PXF maps a Hive 
or Avro complex data type to a text column. The default is a &quot;,&quot; 
character.</td>
    -</tr>
    -<tr class="even">
    -<td>COMPRESSION_CODEC</td>
    -<td><ul>
    -<li>Useful for WRITABLE PXF tables.</li>
    -<li>Specifies the compression codec class name for compressing the written 
data. The class must implement theÂ 
org.apache.hadoop.io.compress.CompressionCodec interface.</li>
    -<li>Â Some valid values areÂ org.apache.hadoop.io.compress.DefaultCodec 
org.apache.hadoop.io.compress.GzipCodec 
org.apache.hadoop.io.compress.BZip2Codec.</li>
    -<li>Note: org.apache.hadoop.io.compress.BZip2Codec runs in a single thread 
and can be slow.</li>
    -<li>This option has no default value.Â </li>
    -<li>When the option is not defined, no compression will be done.</li>
    -</ul></td>
    -</tr>
    -<tr class="odd">
    -<td>COMPRESSION_TYPE</td>
    -<td><ul>
    -<li>Useful WRITABLE PXF tables with SequenceFileAccessor.</li>
    -<li>Ignored when COMPRESSION_CODEC is not defined.</li>
    -<li>Specifies the compression type for sequence file.</li>
    -<li>Valid options are:Â 
    -<ul>
    -<li>RECORD - only the value part of each row is compressed.</li>
    -<li>BLOCK - both keys and values are collected in 'blocks' separately and 
compressed.</li>
    -</ul></li>
    -<li>Default value: RECORD.</li>
    -</ul></td>
    -</tr>
    -<tr class="even">
    -<td>MAPKEY_DELIM</td>
    -<td>(Avro or Hive profiles only.) The delimiter character(s) to place 
between the key and value of a map entry when PXF maps a Hive or Avro complex 
data type to a text colum. The default is a &quot;:&quot; character.</td>
    -</tr>
    -<tr class="odd">
    -<td>RECORDKEY_DELIM</td>
    -<td>(Avro profile only.) The delimiter character(s) to place between the 
field name and value of a record entry when PXF maps an Avro complex data type 
to a text colum. The default is a &quot;:&quot; character.</td>
    -</tr>
    -<tr class="even">
    -<td>SCHEMA-DATA</td>
    -<td>The data schema file used to create and readÂ the HDFS file. For 
example, you could create an avsc (for Avro), or a Java class (for Writable 
Serialization) file. Make sure that you have added any JAR files containing the 
schema to <code class="ph codeph">pxf-public.classpath</code>.
    -<p>This option has no default value.</p></td>
    -</tr>
    -<tr class="odd">
    -<td>THREAD-SAFE</td>
    -<td>Determines if the table query can run in multithread mode or not. When 
set to FALSE, requests will be handled in a single thread.
    -<p>Should be set when a plug-in or other elements that are not thread safe 
are used (e.g. compression codec).</p>
    -<p>Allowed values: TRUE, FALSE. Default value is TRUE - requests can run 
in multithread mode.</p></td>
    -</tr>
    -<tr class="even">
    -<td>Â &lt;custom&gt;</td>
    -<td>Any option added to the pxf URI string will be accepted and passed, 
along with its value, to the Fragmenter, Accessor, and Resolver 
implementations.</td>
    -</tr>
    -</tbody>
    -</table>
    -
    -## <a id="accessingdataonahighavailabilityhdfscluster"></a>Accessing Data 
on a High Availability HDFS Cluster
    -
    -ToÂ access data on a High Availability HDFS cluster, change the authorityÂ 
in the URI in the LOCATION. Use *HA\_nameservice* instead of 
*name\_node\_host:51200*.
    +7. Copy/paste the following data into `pxf_hdfs_multi.txt`:
    +
    +    ``` pre
    +    "4627 Star Rd.
    +    San Francisco, CA  94107":Sept:2017
    +    "113 Moon St.
    +    San Diego, CA  92093":Jan:2018
    +    "51 Belt Ct.
    +    Denver, CO  90123":Dec:2016
    +    "93114 Radial Rd.
    +    Chicago, IL  60605":Jul:2017
    +    "7301 Brookview Ave.
    +    Columbus, OH  43213":Dec:2018
    +    ```
    +
    +    Notice the use of the colon `:` to separate the three fields. Also 
notice the quotes around the first (address) field. This field includes an 
embedded line feed.
    +
    +8. Add the data file to HDFS:
    +
    +    ``` shell
    +    $ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_multi.txt 
/data/pxf_examples/
    +    ```
    +
    +You will use these HDFS files in later sections.
    +
    +## <a id="hdfsplugin_queryextdata"></a>Querying External HDFS Data
    +The PXF HDFS plug-in supports the `HdfsTextSimple`, `HdfsTextMulti`, and 
`Avro` profiles.
    +
    +Use the following syntax to create a HAWQ external table representing HDFS 
data:Â 
     
     ``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE <tbl name> (<attr list>)
    -LOCATION ('pxf://<HA nameservice>/<path to file or 
directory>?Profile=profile[&<additional options>=<value>]')
    -FORMAT '[TEXT | CSV | CUSTOM]' (<formatting properties>);
    +CREATE EXTERNAL TABLE <table_name> 
    +    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    +LOCATION ('pxf://<host>[:<port>]/<path-to-hdfs-file>
    +    
?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro[&<custom-option>=<value>[...]]')
    +FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);
     ```
     
    -The opposite is true when a highly available HDFS cluster is reverted to a 
single namenode configuration. In that case, any table definition that has the 
nameservice specified should use the &lt;NN host&gt;:&lt;NN rest port&gt; 
syntax.Â 
    +HDFS-plug-in-specific keywords and values used in the [CREATE EXTERNAL 
TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described in the 
table below.
    +
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| \<host\>[:\<port\>]    | The HDFS NameNode and port. |
    +| \<path-to-hdfs-file\>    | The path to the file in the HDFS data store. |
    +| PROFILE    | The `PROFILE` keyword must specify one of the values 
`HdfsTextSimple`, `HdfsTextMulti`, or `Avro`. |
    +| \<custom-option\>  | \<custom-option\> is profile-specific. 
Profile-specific options are discussed in the relevant profile topic later in 
this section.|
    +| FORMAT 'TEXT' | Use '`TEXT`' `FORMAT` with the `HdfsTextSimple` profile 
when \<path-to-hdfs-file\> references a plain text delimited file.  |
    +| FORMAT 'CSV' | Use '`CSV`' `FORMAT` with `HdfsTextSimple` and 
`HdfsTextMulti` profiles when \<path-to-hdfs-file\> references a 
comma-separated value file.  |
    +| FORMAT 'CUSTOM' | Use the`CUSTOM` `FORMAT` with  the `Avro` profile. The 
`Avro` '`CUSTOM`' `FORMAT` supports only the built-in 
`(formatter='pxfwritable_import')` \<formatting-property\> |
    + \<formatting-properties\>    | \<formatting-properties\> are 
profile-specific. Profile-specific formatting options are discussed in the 
relevant profile topic later in this section. |
    +
    +*Note*: When creating PXF external tables, you cannot use the `HEADER` 
option in your `FORMAT` specification.
     
    -## <a id="recordkeyinkey-valuefileformats"></a>Using a Record Key with 
Key-Value File Formats
    +## <a id="profile_hdfstextsimple"></a>HdfsTextSimple Profile
     
    -For sequence file and other file formats that store rows in a key-value 
format, the key value can be accessed through HAWQ by using the saved keyword 
'`recordkey`' as a field name.
    +Use the `HdfsTextSimple` profile when reading plain text delimited or .csv 
files where each row is a single record.
     
    -The field type must correspond to the key type, much as the other fields 
must match the HDFS data.Â 
    +\<formatting-properties\> supported by the `HdfsTextSimple` profile 
include:
     
    -WritableResolver supports read and write of recordkey, which can be of the 
following Writable Hadoop types:
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| delimiter    | The delimiter character in the file. Default value is a 
comma `,`.|
     
    --   BooleanWritable
    --   ByteWritable
    --   DoubleWritable
    --   FloatWritable
    --   IntWritable
    --   LongWritable
    --   Text
     
    -If the `recordkey` field is not defined, the key is ignored in read, and a 
default value (segment id as LongWritable) is written in write.
    +The following SQL call uses the PXF `HdfsTextSimple` profile to create a 
queryable HAWQ external table from the `pxf_hdfs_simple.txt` file you created 
and added to HDFS in an earlier section:
    --- End diff --
    
    Let's not assume they've reached this topic by reading/completing the 
previous sections.  I think it would be better to roll the sample file creation 
steps directly into this procedure (making them optional).  That way readers 
can see exactly the correlation between the file contents, external table 
statement, and query output.  It's best to call this out in a new "Example" 
subsection with numbered steps.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #33: HAWQ-1107 - enhance PXF HDFS plugin do...

Reply via email to