Github user dyozie commented on a diff in the pull request:
https://github.com/apache/incubator-hawq-docs/pull/33#discussion_r85003579
--- Diff: pxf/HDFSFileDataPXF.html.md.erb ---
@@ -415,93 +312,101 @@ The following example uses the Avro schema shown in
[Sample Avro Schema](#topic_
{"name":"street", "type":"string"},
{"name":"city", "type":"string"}]
}
- }, {
- "name": "relationship",
- "type": {
- "type": "enum",
- "name": "relationshipEnum",
- "symbols":
["MARRIED","LOVE","FRIEND","COLLEAGUE","STRANGER","ENEMY"]
- }
- }, {
- "name" : "md5",
- "type": {
- "type" : "fixed",
- "name" : "md5Fixed",
- "size" : 4
- }
} ],
"doc:" : "A basic schema for storing messages"
}
```
-#### <a id="topic_tr3_dpg_ts__section_spk_15g_ts"></a>Sample Avro Data
(JSON)
+### <a id="topic_tr3_dpg_ts__section_spk_15g_ts"></a>Sample Avro Data
(JSON)
+
+Create a text file named `pxf_hdfs_avro.txt`:
+
+``` shell
+$ vi /tmp/pxf_hdfs_avro.txt
+```
+
+Enter the following data into `pxf_hdfs_avro.txt`:
``` pre
-{"id":1, "username":"john","followers":["kate", "santosh"], "rank":null,
"relationship": "FRIEND", "fmap": {"kate":10,"santosh":4},
-"address":{"street":"renaissance drive", "number":1,"city":"san jose"},
"md5":\u3F00\u007A\u0073\u0074}
+{"id":1, "username":"john","followers":["kate", "santosh"],
"relationship": "FRIEND", "fmap": {"kate":10,"santosh":4},
"address":{"number":1, "street":"renaissance drive", "city":"san jose"}}
+
+{"id":2, "username":"jim","followers":["john", "pam"], "relationship":
"COLLEAGUE", "fmap": {"john":3,"pam":3}, "address":{"number":9, "street":"deer
creek", "city":"palo alto"}}
+```
+
+The sample data uses a comma `,` to separate top level records and a colon
`:` to separate map/key values and record field name/values.
-{"id":2, "username":"jim","followers":["john", "pam"], "rank":3,
"relationship": "COLLEAGUE", "fmap": {"john":3,"pam":3},
-"address":{"street":"deer creek", "number":9,"city":"palo alto"},
"md5":\u0010\u0021\u0003\u0004}
+Convert the text file to Avro format. There are various ways to perform
the conversion programmatically and via the command line. In this example, we
use the [Java Avro tools](http://avro.apache.org/releases.html), and the jar
file resides in the current directory:
+
+``` shell
+$ java -jar ./avro-tools-1.8.1.jar fromjson --schema-file
/tmp/avro_schema.avsc /tmp/pxf_hdfs_avro.txt > /tmp/pxf_hdfs_avro.avro
```
-To map this Avro file to an external table, the top-level primitive fields
("id" of type long and "username" of type string) are mapped to their
equivalent HAWQ types (bigint and text). The remaining complex fields are
mapped to text columns:
+The generated Avro binary data file is written to
`/tmp/pxf_hdfs_avro.avro`. Copy this file to HDFS:
-``` sql
-gpadmin=# CREATE EXTERNAL TABLE avro_complex
- (id bigint,
- username text,
- followers text,
- rank int,
- fmap text,
- address text,
- relationship text,
- md5 bytea)
-LOCATION ('pxf://namehost:51200/tmp/avro_complex?PROFILE=Avro')
-FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
+``` shell
+$ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_avro.avro /data/pxf_examples/
```
+### <a id="topic_avro_querydata"></a>Querying Avro Data
+
+Create a queryable external table from this Avro file:
-The above command uses default delimiters for separating components of the
complex types. This command is equivalent to the one above, but it explicitly
sets the delimiters using the Avro profile parameters:
+- Map the top-level primitive fields, `id` (type long) and `username`
(type string), to their equivalent HAWQ types (bigint and text).
+- Map the remaining complex fields to type text.
+- Explicitly set the record, map, and collection delimiters using the
Avro profile custom options:
``` sql
-gpadmin=# CREATE EXTERNAL TABLE avro_complex
- (id bigint,
- username text,
- followers text,
- rank int,
- fmap text,
- address text,
- relationship text,
- md5 bytea)
-LOCATION
('pxf://localhost:51200/tmp/avro_complex?PROFILE=Avro&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:')
-FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
+gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_avro(id bigint, username text,
followers text, fmap text, relationship text, address text)
+ LOCATION
('pxf://namenode:51200/data/pxf_examples/pxf_hdfs_avro.avro?PROFILE=Avro&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:')
+ FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
A simple query of the external table shows the components of the complex
type data separated with delimiters:
``` sql
-gpadmin=# select * from avro_complex;
+gpadmin=# SELECT * FROM pxf_hdfs_avro;
```
``` pre
-id | username | followers | rank | fmap | address |
relationship | md5
-1| john | [kate,santosh] | | {kate:10,santosh:4} |
{number:1,street:renaissance drive,city:san jose} | FRIEND | ?zst
-2| jim | [john,pam] | 3 | {john:3,pam:3} | {number:9,street:deer
creek,city:palo alto} | COLLEAGUE | \020!\003\004
+ id | username | followers | fmap | relationship |
address
+----+----------+----------------+--------------------+--------------+---------------------------------------------------
+ 1 | john | [kate,santosh] | {kate:10,santosh:4} | FRIEND |
{number:1,street:renaissance drive,city:san jose}
+ 2 | jim | [john,pam] | {pam:3,john:3} | COLLEAGUE |
{number:9,street:deer creek,city:palo alto}
+(2 rows)
```
-You can process the delimited components in the text columns as necessary
for your application. For example, the following command uses the
`string_to_array` function to convert entries in the "followers" field to a
text array column in a new view. The view is then queried to filter rows based
on whether a particular follower appears in the array:
+Process the delimited components in the text columns as necessary for your
application. For example, the following command uses the HAWQ internal
`string_to_array` function to convert entries in the `followers` field to a
text array column in a new view.
``` sql
-gpadmin=# create view followers_view as
- select username, address, string_to_array(substring(followers from 2 for
(char_length(followers) - 2)), ',')::text[]
- as followers
- from avro_complex;
-
-gpadmin=# select username, address
-from followers_view
-where john = ANY(followers);
+gpadmin=# CREATE VIEW followers_view AS
+ SELECT username, address, string_to_array(substring(followers FROM 2 FOR
(char_length(followers) - 2)), ',')::text[]
+ AS followers
+ FROM pxf_hdfs_avro;
+```
+
+Query the view to filter rows based on whether a particular follower
appears in the array:
+
+```
+gpadmin=# SELECT username, address FROM followers_view WHERE followers @>
'{john}';
```
``` pre
-username | address
-jim | {number:9,street:deer creek,city:palo alto}
+ username | address
+----------+---------------------------------------------
+ jim | {number:9,street:deer creek,city:palo alto}
+```
+
+## <a id="accessdataonahavhdfscluster"></a>Accessing HDFS Data in a High
Availability HDFS Cluster
+
+To access external HDFS data in a High Availability HDFS cluster, change
the `CREATE EXTERNAL TABLE` `LOCATION` clause to use \<HA-nameservice\> rather
than \<host\>[:\<port\>].
+
+``` sql
+gpadmin=# CREATE EXTERNAL TABLE <table_name> ( <column_name> <data_type>
[, ...] | LIKE <other_table> )
+ LOCATION
('pxf://<HA-nameservice>/<path-to-hdfs-file>?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro|SequenceWritable[&<custom-option>=<value>[...]]')
+ FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);
```
+
+The opposite is true when a highly available HDFS cluster is reverted to a
single NameNode configuration. In that case, any table definition that has
specified \<HA-nameservice\> should use the \<host\>[:\<port\>] syntax.Â
+
+
+## <a id="hdfs_advanced"></a>Advanced
--- End diff --
This should probably be mentioned earlier, when we list the built-in
profiles.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---