[
https://issues.apache.org/jira/browse/HAWQ-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15612279#comment-15612279
]
ASF GitHub Bot commented on HAWQ-1107:
--------------------------------------
Github user kavinderd commented on a diff in the pull request:
https://github.com/apache/incubator-hawq-docs/pull/33#discussion_r85358483
--- Diff: pxf/HDFSFileDataPXF.html.md.erb ---
@@ -2,506 +2,449 @@
title: Accessing HDFS File Data
---
-## <a id="installingthepxfhdfsplugin"></a>Prerequisites
+HDFS is the primary distributed storage mechanism used by Apache Hadoop
applications. The PXF HDFS plug-in reads file data stored in HDFS. The plug-in
supports plain delimited and comma-separated-value format text files. The HDFS
plug-in also supports the Avro binary format.
-Before working with HDFS file data using HAWQ and PXF, you should perform
the following operations:
+This section describes how to use PXF to access HDFS data, including how
to create and query an external table from files in the HDFS data store.
-- Test PXF on HDFS before connecting to Hive or HBase.
-- Ensure that all HDFS users have read permissions to HDFS services and
that write permissions have been limited to specific users.
+## <a id="hdfsplugin_prereq"></a>Prerequisites
-## <a id="syntax1"></a>Syntax
+Before working with HDFS file data using HAWQ and PXF, ensure that:
-The syntax for creating an external HDFS file is as follows:
+- The HDFS plug-in is installed on all cluster nodes. See [Installing
PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
+- All HDFS users have read permissions to HDFS services and that write
permissions have been restricted to specific users.
-``` sql
-CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name
- ( column_name data_type [, ...] | LIKE other_table )
-LOCATION ('pxf://host[:port]/path-to-data?<pxf
parameters>[&custom-option=value...]')
- FORMAT '[TEXT | CSV | CUSTOM]' (<formatting_properties>);
-```
+## <a id="hdfsplugin_fileformats"></a>HDFS File Formats
-where `<pxf parameters>` is:
+The PXF HDFS plug-in supports reading the following file formats:
-``` pre
-
FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
- | PROFILE=profile-name
-```
+- Text File - comma-separated value (.csv) or delimited format plain text
file
+- Avro - JSON-defined, schema-based data serialization format
-**Note:** Omit the `FRAGMENTER` parameter for `READABLE` external tables.
+The PXF HDFS plug-in includes the following profiles to support the file
formats listed above:
-Use an SQL `SELECT` statement to read from an HDFS READABLE table:
+- `HdfsTextSimple` - text files
+- `HdfsTextMulti` - text files with embedded line feeds
+- `Avro` - Avro files
-``` sql
-SELECT ... FROM table_name;
+If you find that the pre-defined PXF HDFS profiles do not meet your needs,
you may choose to create a custom HDFS profile from the existing HDFS
serialization and deserialization classes. Refer to [Adding and Updating
Profiles](ReadWritePXF.html#addingandupdatingprofiles) for information on
creating a custom profile.
+
+## <a id="hdfsplugin_cmdline"></a>HDFS Shell Commands
+Hadoop includes command-line tools that interact directly with HDFS.
These tools support typical file system operations including copying and
listing files, changing file permissions, and so forth.
+
+The HDFS file system command syntax is `hdfs dfs <options> [<file>]`.
Invoked with no options, `hdfs dfs` lists the file system options supported by
the tool.
+
+`hdfs dfs` options used in this topic are:
+
+| Option | Description |
+|-------|-------------------------------------|
+| `-cat` | Display file contents. |
+| `-mkdir` | Create directory in HDFS. |
+| `-put` | Copy file from local file system to HDFS. |
+
+Examples:
+
+Create a directory in HDFS:
+
+``` shell
+$ sudo -u hdfs hdfs dfs -mkdir -p /data/exampledir
```
-Use an SQL `INSERT` statement to add data to an HDFS WRITABLE table:
+Copy a text file to HDFS:
-``` sql
-INSERT INTO table_name ...;
+``` shell
+$ sudo -u hdfs hdfs dfs -put /tmp/example.txt /data/exampledir/
```
-To read the data in the files or to write based on the existing format,
use `FORMAT`, `PROFILE`, or one of the classes.
-
-This topic describes the following:
-
-- FORMAT clause
-- Profile
-- Accessor
-- Resolver
-- Avro
-
-**Note:** For more details about the API and classes, see [PXF External
Tables and
API](PXFExternalTableandAPIReference.html#pxfexternaltableandapireference).
-
-### <a id="formatclause"></a>FORMAT clause
-
-Use one of the following formats to read data with any PXF connector:
-
-- `FORMAT 'TEXT'`: Use with plain delimited text files on HDFS.
-- `FORMAT 'CSV'`: Use with comma-separated value files on HDFS.
-- `FORMAT 'CUSTOM'`: Use with all other files, including Avro format and
binary formats. Must always be used with the built-in formatter
'`pxfwritable_import`' (for read) or '`pxfwritable_export`' (for write).
-
-**Note:** When creating PXF external tables, you cannot use the `HEADER`
option in your `FORMAT` specification.
-
-### <a id="topic_ab2_sxy_bv"></a>Profile
-
-For plain or comma-separated text files in HDFS use either the
`HdfsTextSimple` or `HdfsTextMulti` Profile, or the classname
org.apache.hawq.pxf.plugins.hdfs.*HdfsDataFragmenter*. Use the `Avro` profile
for Avro files. See [Using Profiles to Read and Write
Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information.
-
-**Note:** For read tables, you must include a Profile or a Fragmenter in
the table definition.
-
-### <a id="accessor"></a>Accessor
-
-The choice of an Accessor depends on the HDFS data file type.
-
-**Note:** You must include either a Profile or an Accessor in the table
definition.
-
-<table>
-<colgroup>
-<col width="25%" />
-<col width="25%" />
-<col width="25%" />
-<col width="25%" />
-</colgroup>
-<thead>
-<tr class="header">
-<th>File Type</th>
-<th>Accessor</th>
-<th>FORMAT clause</th>
-<th>Comments</th>
-</tr>
-</thead>
-<tbody>
-<tr class="odd">
-<td>Plain Text delimited</td>
-<td>org.apache.hawq.pxf.plugins. hdfs.LineBreakAccessor</td>
-<td>FORMAT 'TEXT' (<em>format param list</em>)</td>
-<td> Read + Write
-<p>You cannot use the <code class="ph codeph">HEADER</code>
option.</p></td>
-</tr>
-<tr class="even">
-<td>Plain Text CSV </td>
-<td>org.apache.hawq.pxf.plugins. hdfs.LineBreakAccessor</td>
-<td>FORMAT 'CSV' (<em>format param list</em>) </td>
-<td><p>LineBreakAccessor is parallel and faster.</p>
-<p>Use if each logical data row is a physical data line.</p>
-<p>Read + Write </p>
-<p>You cannot use the <code class="ph codeph">HEADER</code>
option.</p></td>
-</tr>
-<tr class="odd">
-<td>Plain Text CSV </td>
-<td>org.apache.hawq.pxf.plugins. hdfs.QuotedLineBreakAccessor</td>
-<td>FORMAT 'CSV' (<em>format param list</em>) </td>
-<td><p>QuotedLineBreakAccessor is slower and non-parallel.</p>
-<p>Use if the data includes embedded (quoted) linefeed characters.</p>
-<p>Read Only </p>
-<p>You cannot use the <code class="ph codeph">HEADER</code>
option.</p></td>
-</tr>
-<tr class="even">
-<td>SequenceFile</td>
-<td>org.apache.hawq.pxf.plugins. hdfs.SequenceFileAccessor</td>
-<td>FORMAT 'CUSTOM' (formatter='pxfwritable_import')</td>
-<td> Read + Write (use formatter='pxfwritable_export' for write)</td>
-</tr>
-<tr class="odd">
-<td>AvroFile</td>
-<td>org.apache.hawq.pxf.plugins. hdfs.AvroFileAccessor</td>
-<td>FORMAT 'CUSTOM' (formatter='pxfwritable_import')</td>
-<td> Read Only</td>
-</tr>
-</tbody>
-</table>
-
-### <a id="resolver"></a>Resolver
-
-Choose the Resolver format if data records are serialized in the HDFS
file.
-
-**Note:** You must include a Profile or a Resolver in the table definition.
-
-<table>
-<colgroup>
-<col width="33%" />
-<col width="33%" />
-<col width="33%" />
-</colgroup>
-<thead>
-<tr class="header">
-<th>Record Serialization</th>
-<th>Resolver</th>
-<th>Comments</th>
-</tr>
-</thead>
-<tbody>
-<tr class="odd">
-<td>Avro</td>
-<td>org.apache.hawq.pxf.plugins. hdfs.AvroResolver</td>
-<td><ul>
-<li>Avro files include the record schema, Avro serialization can be used
in other file types (e.g, Sequence File). </li>
-<li>For Avro serialized records outside of an Avro file, include a schema
file name (.avsc) in the url under the optional <code class="ph
codeph">Schema-Data </code>option.</li>
-<li>Deserialize Only (Read) .</li>
-</ul></td>
-</tr>
-<tr class="even">
-<td>Java Writable</td>
-<td>org.apache.hawq.pxf.plugins. hdfs.WritableResolver</td>
-<td><ul>
-<li>Include the name of the Java class that uses Writable serialization in
the URL under the optional <code class="ph codeph">Schema-Data.</code></li>
-<li>The class file must exist in the public stage directory (or in
Hadoop's class path).</li>
-<li>Deserialize and Serialize (Read + Write). </li>
-<li>See <a href="#customizedwritableschemafileguidelines">Customized
Writable Schema File Guidelines</a>.</li>
-</ul></td>
-</tr>
-<tr class="odd">
-<td>None (plain text)</td>
-<td>org.apache.hawq.pxf.plugins. hdfs.StringPassResolver</td>
-<td><ul>
-<li>Does not serialize plain text records. The database parses plain
records. Passes records as they are.</li>
-<li>Deserialize and Serialize (Read + Write).</li>
-</ul></td>
-</tr>
-</tbody>
-</table>
-
-#### <a id="customizedwritableschemafileguidelines"></a>Schema File
Guidelines for WritableResolver
-
-When using a WritableResolver, a schema file needs to be defined. The file
needs to be a Java class file and must be on the class path of PXF.
-
-The class file must follow the following requirements:
-
-1. Must implement org.apache.hadoop.io.Writable interface.
-2. WritableResolver uses reflection to recreate the schema and populate
its fields (for both read and write). Then it uses the Writable interface
functions to read/write. Therefore, fields must be public, to enable access to
them. Private fields will be ignored.
-3. Fields are accessed and populated in the order in which they are
declared in the class file.
-4. Supported field types:
- - boolean
- - byte array
- - double
- - float
- - int
- - long
- - short
- - string
-
- Arrays of any of the above types are supported, but the constructor
must define the array size so the reflection will work.
-
-### <a id="additionaloptions"></a>Additional Options
-
-<a id="additionaloptions__table_skq_kpz_4p"></a>
-
-<table>
-<caption><span class="tablecap">Table 1. Additional PXF
Options</span></caption>
-<colgroup>
-<col width="50%" />
-<col width="50%" />
-</colgroup>
-<thead>
-<tr class="header">
-<th>Option Name</th>
-<th>Description</th>
-</tr>
-</thead>
-<tbody>
-<tr class="odd">
-<td>COLLECTION_DELIM</td>
-<td>(Avro or Hive profiles only.) The delimiter character(s) to place
between entries in a top-level array, map, or record field when PXF maps a Hive
or Avro complex data type to a text column. The default is a ","
character.</td>
-</tr>
-<tr class="even">
-<td>COMPRESSION_CODEC</td>
-<td><ul>
-<li>Useful for WRITABLE PXF tables.</li>
-<li>Specifies the compression codec class name for compressing the written
data. The class must implement the
org.apache.hadoop.io.compress.CompressionCodec interface.</li>
-<li> Some valid values are org.apache.hadoop.io.compress.DefaultCodec
org.apache.hadoop.io.compress.GzipCodec
org.apache.hadoop.io.compress.BZip2Codec.</li>
-<li>Note: org.apache.hadoop.io.compress.BZip2Codec runs in a single thread
and can be slow.</li>
-<li>This option has no default value. </li>
-<li>When the option is not defined, no compression will be done.</li>
-</ul></td>
-</tr>
-<tr class="odd">
-<td>COMPRESSION_TYPE</td>
-<td><ul>
-<li>Useful WRITABLE PXF tables with SequenceFileAccessor.</li>
-<li>Ignored when COMPRESSION_CODEC is not defined.</li>
-<li>Specifies the compression type for sequence file.</li>
-<li>Valid options are:
-<ul>
-<li>RECORD - only the value part of each row is compressed.</li>
-<li>BLOCK - both keys and values are collected in 'blocks' separately and
compressed.</li>
-</ul></li>
-<li>Default value: RECORD.</li>
-</ul></td>
-</tr>
-<tr class="even">
-<td>MAPKEY_DELIM</td>
-<td>(Avro or Hive profiles only.) The delimiter character(s) to place
between the key and value of a map entry when PXF maps a Hive or Avro complex
data type to a text colum. The default is a ":" character.</td>
-</tr>
-<tr class="odd">
-<td>RECORDKEY_DELIM</td>
-<td>(Avro profile only.) The delimiter character(s) to place between the
field name and value of a record entry when PXF maps an Avro complex data type
to a text colum. The default is a ":" character.</td>
-</tr>
-<tr class="even">
-<td>SCHEMA-DATA</td>
-<td>The data schema file used to create and read the HDFS file. For
example, you could create an avsc (for Avro), or a Java class (for Writable
Serialization) file. Make sure that you have added any JAR files containing the
schema to <code class="ph codeph">pxf-public.classpath</code>.
-<p>This option has no default value.</p></td>
-</tr>
-<tr class="odd">
-<td>THREAD-SAFE</td>
-<td>Determines if the table query can run in multithread mode or not. When
set to FALSE, requests will be handled in a single thread.
-<p>Should be set when a plug-in or other elements that are not thread safe
are used (e.g. compression codec).</p>
-<p>Allowed values: TRUE, FALSE. Default value is TRUE - requests can run
in multithread mode.</p></td>
-</tr>
-<tr class="even">
-<td> <custom></td>
-<td>Any option added to the pxf URI string will be accepted and passed,
along with its value, to the Fragmenter, Accessor, and Resolver
implementations.</td>
-</tr>
-</tbody>
-</table>
-
-## <a id="accessingdataonahighavailabilityhdfscluster"></a>Accessing Data
on a High Availability HDFS Cluster
-
-To access data on a High Availability HDFS cluster, change the authority
in the URI in the LOCATION. Use *HA\_nameservice* instead of
*name\_node\_host:51200*.
+Display the contents of a text file in HDFS:
+
+``` shell
+$ sudo -u hdfs hdfs dfs -cat /data/exampledir/example.txt
+```
+
+
+## <a id="hdfsplugin_queryextdata"></a>Querying External HDFS Data
+The PXF HDFS plug-in supports the `HdfsTextSimple`, `HdfsTextMulti`, and
`Avro` profiles.
+
+Use the following syntax to create a HAWQ external table representing HDFS
data:
``` sql
-CREATE [READABLE|WRITABLE] EXTERNAL TABLE <tbl name> (<attr list>)
-LOCATION ('pxf://<HA nameservice>/<path to file or
directory>?Profile=profile[&<additional options>=<value>]')
-FORMAT '[TEXT | CSV | CUSTOM]' (<formatting properties>);
+CREATE EXTERNAL TABLE <table_name>
+ ( <column_name> <data_type> [, ...] | LIKE <other_table> )
+LOCATION ('pxf://<host>[:<port>]/<path-to-hdfs-file>
--- End diff --
Why the brackets around `:<port>`?
> PXF HDFS documentation - restructure content and include more examples
> ----------------------------------------------------------------------
>
> Key: HAWQ-1107
> URL: https://issues.apache.org/jira/browse/HAWQ-1107
> Project: Apache HAWQ
> Issue Type: Improvement
> Components: Documentation
> Reporter: Lisa Owen
> Assignee: David Yozie
> Priority: Minor
> Fix For: 2.0.1.0-incubating
>
>
> the current PXF HDFS documentation does not include any runnable examples.
> add runnable examples for all (HdfsTextSimple, HdfsTextMulti, SerialWritable,
> Avro) profiles. restructure the content as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)