[jira] [Commented] (HAWQ-1107) PXF HDFS documentation - restructure content and include more examples

ASF GitHub Bot (JIRA) Thu, 27 Oct 2016 08:48:53 -0700

    [ 
https://issues.apache.org/jira/browse/HAWQ-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15612279#comment-15612279
 ]


ASF GitHub Bot commented on HAWQ-1107:
--------------------------------------

Github user kavinderd commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/33#discussion_r85358483
  
    --- Diff: pxf/HDFSFileDataPXF.html.md.erb ---
    @@ -2,506 +2,449 @@
     title: Accessing HDFS File Data
     ---
     
    -## <a id="installingthepxfhdfsplugin"></a>Prerequisites
    +HDFS is the primary distributed storage mechanism used by Apache Hadoop 
applications. The PXF HDFS plug-in reads file data stored in HDFS.  The plug-in 
supports plain delimited and comma-separated-value format text files.  The HDFS 
plug-in also supports the Avro binary format.
     
    -Before working with HDFS file data using HAWQ and PXF, you should perform 
the following operations:
    +This section describes how to use PXF to access HDFS data, including how 
to create and query an external table from files in the HDFS data store.
     
    --   Test PXF on HDFS before connecting to Hive or HBase.
    --   Ensure that all HDFS users have read permissions to HDFS services and 
that write permissions have been limited to specific users.
    +## <a id="hdfsplugin_prereq"></a>Prerequisites
     
    -## <a id="syntax1"></a>Syntax
    +Before working with HDFS file data using HAWQ and PXF, ensure that:
     
    -The syntax for creating an external HDFS file is as follows: 
    +-   The HDFS plug-in is installed on all cluster nodes. See [Installing 
PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
    +-   All HDFS users have read permissions to HDFS services and that write 
permissions have been restricted to specific users.
     
    -``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
    -    ( column_name data_type [, ...] | LIKE other_table )
    -LOCATION ('pxf://host[:port]/path-to-data?<pxf 
parameters>[&custom-option=value...]')
    -      FORMAT '[TEXT | CSV | CUSTOM]' (<formatting_properties>);
    -```
    +## <a id="hdfsplugin_fileformats"></a>HDFS File Formats
     
    -where `<pxf parameters>` is:
    +The PXF HDFS plug-in supports reading the following file formats:
     
    -``` pre
    -   
FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
    - | PROFILE=profile-name
    -```
    +- Text File - comma-separated value (.csv) or delimited format plain text 
file
    +- Avro - JSON-defined, schema-based data serialization format
     
    -**Note:** Omit the `FRAGMENTER` parameter for `READABLE` external tables.
    +The PXF HDFS plug-in includes the following profiles to support the file 
formats listed above:
     
    -Use an SQL `SELECT` statement to read from an HDFS READABLE table:
    +- `HdfsTextSimple` - text files
    +- `HdfsTextMulti` - text files with embedded line feeds
    +- `Avro` - Avro files
     
    -``` sql
    -SELECT ... FROM table_name;
    +If you find that the pre-defined PXF HDFS profiles do not meet your needs, 
you may choose to create a custom HDFS profile from the existing HDFS 
serialization and deserialization classes. Refer to [Adding and Updating 
Profiles](ReadWritePXF.html#addingandupdatingprofiles) for information on 
creating a custom profile.
    +
    +## <a id="hdfsplugin_cmdline"></a>HDFS Shell Commands
    +Hadoop includes command-line tools that interact directly with HDFS.  
These tools support typical file system operations including copying and 
listing files, changing file permissions, and so forth. 
    +
    +The HDFS file system command syntax is `hdfs dfs <options> [<file>]`. 
Invoked with no options, `hdfs dfs` lists the file system options supported by 
the tool.
    +
    +`hdfs dfs` options used in this topic are:
    +
    +| Option  | Description |
    +|-------|-------------------------------------|
    +| `-cat`    | Display file contents. |
    +| `-mkdir`    | Create directory in HDFS. |
    +| `-put`    | Copy file from local file system to HDFS. |
    +
    +Examples:
    +
    +Create a directory in HDFS:
    +
    +``` shell
    +$ sudo -u hdfs hdfs dfs -mkdir -p /data/exampledir
     ```
     
    -Use an SQL `INSERT` statement to add data to an HDFS WRITABLE table:
    +Copy a text file to HDFS:
     
    -``` sql
    -INSERT INTO table_name ...;
    +``` shell
    +$ sudo -u hdfs hdfs dfs -put /tmp/example.txt /data/exampledir/
     ```
     
    -To read the data in the files or to write based on the existing format, 
use `FORMAT`, `PROFILE`, or one of the classes.
    -
    -This topic describes the following:
    -
    --   FORMAT clause
    --   Profile
    --   Accessor
    --   Resolver
    --   Avro
    -
    -**Note:** For more details about the API and classes, see [PXF External 
Tables and 
API](PXFExternalTableandAPIReference.html#pxfexternaltableandapireference).
    -
    -### <a id="formatclause"></a>FORMAT clause
    -
    -Use one of the following formats to read data with any PXF connector:
    -
    --   `FORMAT 'TEXT'`: Use with plain delimited text files on HDFS.
    --   `FORMAT 'CSV'`: Use with comma-separated value files on HDFS.
    --   `FORMAT 'CUSTOM'`: Use with all other files, including Avro format and 
binary formats. Must always be used with the built-in formatter 
'`pxfwritable_import`' (for read) or '`pxfwritable_export`' (for write).
    -
    -**Note:** When creating PXF external tables, you cannot use the `HEADER` 
option in your `FORMAT` specification.
    -
    -### <a id="topic_ab2_sxy_bv"></a>Profile
    -
    -For plain or comma-separated text files in HDFS use either the 
`HdfsTextSimple` or `HdfsTextMulti` Profile, or the classname 
org.apache.hawq.pxf.plugins.hdfs.*HdfsDataFragmenter*. Use the `Avro` profile 
for Avro files. See [Using Profiles to Read and Write 
Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information.
    -
    -**Note:** For read tables, you must include a Profile or a Fragmenter in 
the table definition.
    -
    -### <a id="accessor"></a>Accessor
    -
    -The choice of an Accessor depends on the HDFS data file type. 
    -
    -**Note:** You must include either a Profile or an Accessor in the table 
definition.
    -
    -<table>
    -<colgroup>
    -<col width="25%" />
    -<col width="25%" />
    -<col width="25%" />
    -<col width="25%" />
    -</colgroup>
    -<thead>
    -<tr class="header">
    -<th>File Type</th>
    -<th>Accessor</th>
    -<th>FORMAT clause</th>
    -<th>Comments</th>
    -</tr>
    -</thead>
    -<tbody>
    -<tr class="odd">
    -<td>Plain Text delimited</td>
    -<td>org.apache.hawq.pxf.plugins. hdfs.LineBreakAccessor</td>
    -<td>FORMAT 'TEXT' (<em>format param list</em>)</td>
    -<td> Read + Write
    -<p>You cannot use the <code class="ph codeph">HEADER</code> 
option.</p></td>
    -</tr>
    -<tr class="even">
    -<td>Plain Text CSV </td>
    -<td>org.apache.hawq.pxf.plugins. hdfs.LineBreakAccessor</td>
    -<td>FORMAT 'CSV' (<em>format param list</em>) </td>
    -<td><p>LineBreakAccessor is parallel and faster.</p>
    -<p>Use if each logical data row is a physical data line.</p>
    -<p>Read + Write </p>
    -<p>You cannot use the <code class="ph codeph">HEADER</code> 
option.</p></td>
    -</tr>
    -<tr class="odd">
    -<td>Plain Text CSV </td>
    -<td>org.apache.hawq.pxf.plugins. hdfs.QuotedLineBreakAccessor</td>
    -<td>FORMAT 'CSV' (<em>format param list</em>) </td>
    -<td><p>QuotedLineBreakAccessor is slower and non-parallel.</p>
    -<p>Use if the data includes embedded (quoted) linefeed characters.</p>
    -<p>Read Only </p>
    -<p>You cannot use the <code class="ph codeph">HEADER</code> 
option.</p></td>
    -</tr>
    -<tr class="even">
    -<td>SequenceFile</td>
    -<td>org.apache.hawq.pxf.plugins. hdfs.SequenceFileAccessor</td>
    -<td>FORMAT 'CUSTOM' (formatter='pxfwritable_import')</td>
    -<td> Read + Write (use formatter='pxfwritable_export' for write)</td>
    -</tr>
    -<tr class="odd">
    -<td>AvroFile</td>
    -<td>org.apache.hawq.pxf.plugins. hdfs.AvroFileAccessor</td>
    -<td>FORMAT 'CUSTOM' (formatter='pxfwritable_import')</td>
    -<td> Read Only</td>
    -</tr>
    -</tbody>
    -</table>
    -
    -### <a id="resolver"></a>Resolver
    -
    -Choose the Resolver format if data records are serialized in the HDFS 
file. 
    -
    -**Note:** You must include a Profile or a Resolver in the table definition.
    -
    -<table>
    -<colgroup>
    -<col width="33%" />
    -<col width="33%" />
    -<col width="33%" />
    -</colgroup>
    -<thead>
    -<tr class="header">
    -<th>Record Serialization</th>
    -<th>Resolver</th>
    -<th>Comments</th>
    -</tr>
    -</thead>
    -<tbody>
    -<tr class="odd">
    -<td>Avro</td>
    -<td>org.apache.hawq.pxf.plugins. hdfs.AvroResolver</td>
    -<td><ul>
    -<li>Avro files include the record schema, Avro serialization can be used 
in other file types (e.g, Sequence File). </li>
    -<li>For Avro serialized records outside of an Avro file, include a schema 
file name (.avsc) in the url under the optional <code class="ph 
codeph">Schema-Data </code>option.</li>
    -<li>Deserialize Only (Read) .</li>
    -</ul></td>
    -</tr>
    -<tr class="even">
    -<td>Java Writable</td>
    -<td>org.apache.hawq.pxf.plugins. hdfs.WritableResolver</td>
    -<td><ul>
    -<li>Include the name of the Java class that uses Writable serialization in 
the URL under the optional <code class="ph codeph">Schema-Data.</code></li>
    -<li>The class file must exist in the public stage directory (or in 
Hadoop's class path).</li>
    -<li>Deserialize and Serialize (Read + Write). </li>
    -<li>See <a href="#customizedwritableschemafileguidelines">Customized 
Writable Schema File Guidelines</a>.</li>
    -</ul></td>
    -</tr>
    -<tr class="odd">
    -<td>None (plain text)</td>
    -<td>org.apache.hawq.pxf.plugins. hdfs.StringPassResolver</td>
    -<td><ul>
    -<li>Does not serialize plain text records. The database parses plain 
records. Passes records as they are.</li>
    -<li>Deserialize and Serialize (Read + Write).</li>
    -</ul></td>
    -</tr>
    -</tbody>
    -</table>
    -
    -#### <a id="customizedwritableschemafileguidelines"></a>Schema File 
Guidelines for WritableResolver
    -
    -When using a WritableResolver, a schema file needs to be defined. The file 
needs to be a Java class file and must be on the class path of PXF.
    -
    -The class file must follow the following requirements:
    -
    -1.  Must implement org.apache.hadoop.io.Writable interface.
    -2.  WritableResolver uses reflection to recreate the schema and populate 
its fields (for both read and write). Then it uses the Writable interface 
functions to read/write. Therefore, fields must be public, to enable access to 
them. Private fields will be ignored.
    -3.  Fields are accessed and populated in the order in which they are 
declared in the class file.
    -4.  Supported field types:
    -    -   boolean
    -    -   byte array
    -    -   double
    -    -   float
    -    -   int
    -    -   long
    -    -   short
    -    -   string
    -
    -    Arrays of any of the above types are supported, but the constructor 
must define the array size so the reflection will work.
    -
    -### <a id="additionaloptions"></a>Additional Options
    -
    -<a id="additionaloptions__table_skq_kpz_4p"></a>
    -
    -<table>
    -<caption><span class="tablecap">Table 1. Additional PXF 
Options</span></caption>
    -<colgroup>
    -<col width="50%" />
    -<col width="50%" />
    -</colgroup>
    -<thead>
    -<tr class="header">
    -<th>Option Name</th>
    -<th>Description</th>
    -</tr>
    -</thead>
    -<tbody>
    -<tr class="odd">
    -<td>COLLECTION_DELIM</td>
    -<td>(Avro or Hive profiles only.) The delimiter character(s) to place 
between entries in a top-level array, map, or record field when PXF maps a Hive 
or Avro complex data type to a text column. The default is a &quot;,&quot; 
character.</td>
    -</tr>
    -<tr class="even">
    -<td>COMPRESSION_CODEC</td>
    -<td><ul>
    -<li>Useful for WRITABLE PXF tables.</li>
    -<li>Specifies the compression codec class name for compressing the written 
data. The class must implement the 
org.apache.hadoop.io.compress.CompressionCodec interface.</li>
    -<li> Some valid values are org.apache.hadoop.io.compress.DefaultCodec 
org.apache.hadoop.io.compress.GzipCodec 
org.apache.hadoop.io.compress.BZip2Codec.</li>
    -<li>Note: org.apache.hadoop.io.compress.BZip2Codec runs in a single thread 
and can be slow.</li>
    -<li>This option has no default value. </li>
    -<li>When the option is not defined, no compression will be done.</li>
    -</ul></td>
    -</tr>
    -<tr class="odd">
    -<td>COMPRESSION_TYPE</td>
    -<td><ul>
    -<li>Useful WRITABLE PXF tables with SequenceFileAccessor.</li>
    -<li>Ignored when COMPRESSION_CODEC is not defined.</li>
    -<li>Specifies the compression type for sequence file.</li>
    -<li>Valid options are: 
    -<ul>
    -<li>RECORD - only the value part of each row is compressed.</li>
    -<li>BLOCK - both keys and values are collected in 'blocks' separately and 
compressed.</li>
    -</ul></li>
    -<li>Default value: RECORD.</li>
    -</ul></td>
    -</tr>
    -<tr class="even">
    -<td>MAPKEY_DELIM</td>
    -<td>(Avro or Hive profiles only.) The delimiter character(s) to place 
between the key and value of a map entry when PXF maps a Hive or Avro complex 
data type to a text colum. The default is a &quot;:&quot; character.</td>
    -</tr>
    -<tr class="odd">
    -<td>RECORDKEY_DELIM</td>
    -<td>(Avro profile only.) The delimiter character(s) to place between the 
field name and value of a record entry when PXF maps an Avro complex data type 
to a text colum. The default is a &quot;:&quot; character.</td>
    -</tr>
    -<tr class="even">
    -<td>SCHEMA-DATA</td>
    -<td>The data schema file used to create and read the HDFS file. For 
example, you could create an avsc (for Avro), or a Java class (for Writable 
Serialization) file. Make sure that you have added any JAR files containing the 
schema to <code class="ph codeph">pxf-public.classpath</code>.
    -<p>This option has no default value.</p></td>
    -</tr>
    -<tr class="odd">
    -<td>THREAD-SAFE</td>
    -<td>Determines if the table query can run in multithread mode or not. When 
set to FALSE, requests will be handled in a single thread.
    -<p>Should be set when a plug-in or other elements that are not thread safe 
are used (e.g. compression codec).</p>
    -<p>Allowed values: TRUE, FALSE. Default value is TRUE - requests can run 
in multithread mode.</p></td>
    -</tr>
    -<tr class="even">
    -<td> &lt;custom&gt;</td>
    -<td>Any option added to the pxf URI string will be accepted and passed, 
along with its value, to the Fragmenter, Accessor, and Resolver 
implementations.</td>
    -</tr>
    -</tbody>
    -</table>
    -
    -## <a id="accessingdataonahighavailabilityhdfscluster"></a>Accessing Data 
on a High Availability HDFS Cluster
    -
    -To access data on a High Availability HDFS cluster, change the authority 
in the URI in the LOCATION. Use *HA\_nameservice* instead of 
*name\_node\_host:51200*.
    +Display the contents of a text file in HDFS:
    +
    +``` shell
    +$ sudo -u hdfs hdfs dfs -cat /data/exampledir/example.txt
    +```
    +
    +
    +## <a id="hdfsplugin_queryextdata"></a>Querying External HDFS Data
    +The PXF HDFS plug-in supports the `HdfsTextSimple`, `HdfsTextMulti`, and 
`Avro` profiles.
    +
    +Use the following syntax to create a HAWQ external table representing HDFS 
data: 
     
     ``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE <tbl name> (<attr list>)
    -LOCATION ('pxf://<HA nameservice>/<path to file or 
directory>?Profile=profile[&<additional options>=<value>]')
    -FORMAT '[TEXT | CSV | CUSTOM]' (<formatting properties>);
    +CREATE EXTERNAL TABLE <table_name> 
    +    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    +LOCATION ('pxf://<host>[:<port>]/<path-to-hdfs-file>
    --- End diff --
    
    Why the brackets around `:<port>`?


> PXF HDFS documentation - restructure content and include more examples
> ----------------------------------------------------------------------
>
>                 Key: HAWQ-1107
>                 URL: https://issues.apache.org/jira/browse/HAWQ-1107
>             Project: Apache HAWQ
>          Issue Type: Improvement
>          Components: Documentation
>            Reporter: Lisa Owen
>            Assignee: David Yozie
>            Priority: Minor
>             Fix For: 2.0.1.0-incubating
>
>
> the current PXF HDFS documentation does not include any runnable examples.  
> add runnable examples for all (HdfsTextSimple, HdfsTextMulti, SerialWritable, 
> Avro) profiles.  restructure the content as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HAWQ-1107) PXF HDFS documentation - restructure content and include more examples

Reply via email to