[GitHub] incubator-hawq-docs pull request #33: HAWQ-1107 - enhance PXF HDFS plugin do...

dyozie Tue, 25 Oct 2016 14:21:27 -0700

Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/33#discussion_r85003579
  
    --- Diff: pxf/HDFSFileDataPXF.html.md.erb ---
    @@ -415,93 +312,101 @@ The following example uses the Avro schema shown in 
[Sample Avro Schema](#topic_
                 {"name":"street", "type":"string"},
                 {"name":"city", "type":"string"}]
         }
    -  }, {
    -   "name": "relationship",
    -    "type": {
    -        "type": "enum",
    -        "name": "relationshipEnum",
    -        "symbols": 
["MARRIED","LOVE","FRIEND","COLLEAGUE","STRANGER","ENEMY"]
    -    }
    -  }, {
    -    "name" : "md5",
    -    "type": {
    -        "type" : "fixed",
    -        "name" : "md5Fixed",
    -        "size" : 4
    -    }
       } ],
       "doc:" : "A basic schema for storing messages"
     }
     ```
     
    -#### <a id="topic_tr3_dpg_ts__section_spk_15g_ts"></a>Sample Avro Data 
(JSON)
    +### <a id="topic_tr3_dpg_ts__section_spk_15g_ts"></a>Sample Avro Data 
(JSON)
    +
    +Create a text file named `pxf_hdfs_avro.txt`:
    +
    +``` shell
    +$ vi /tmp/pxf_hdfs_avro.txt
    +```
    +
    +Enter the following data into `pxf_hdfs_avro.txt`:
     
     ``` pre
    -{"id":1, "username":"john","followers":["kate", "santosh"], "rank":null, 
"relationship": "FRIEND", "fmap": {"kate":10,"santosh":4},
    -"address":{"street":"renaissance drive", "number":1,"city":"san jose"}, 
"md5":\u3F00\u007A\u0073\u0074}
    +{"id":1, "username":"john","followers":["kate", "santosh"], 
"relationship": "FRIEND", "fmap": {"kate":10,"santosh":4}, 
"address":{"number":1, "street":"renaissance drive", "city":"san jose"}}
    +    
    +{"id":2, "username":"jim","followers":["john", "pam"], "relationship": 
"COLLEAGUE", "fmap": {"john":3,"pam":3}, "address":{"number":9, "street":"deer 
creek", "city":"palo alto"}}
    +```
    +
    +The sample data uses a comma `,` to separate top level records and a colon 
`:` to separate map/key values and record field name/values.
     
    -{"id":2, "username":"jim","followers":["john", "pam"], "rank":3, 
"relationship": "COLLEAGUE", "fmap": {"john":3,"pam":3}, 
    -"address":{"street":"deer creek", "number":9,"city":"palo alto"}, 
"md5":\u0010\u0021\u0003\u0004}
    +Convert the text file to Avro format. There are various ways to perform 
the conversion programmatically and via the command line. In this example, we 
use the [Java Avro tools](http://avro.apache.org/releases.html), and the jar 
file resides in the current directory:
    +
    +``` shell
    +$ java -jar ./avro-tools-1.8.1.jar fromjson --schema-file 
/tmp/avro_schema.avsc /tmp/pxf_hdfs_avro.txt > /tmp/pxf_hdfs_avro.avro
     ```
     
    -To map this Avro file to an external table, the top-level primitive fields 
("id" of type long and "username" of type string) are mapped to their 
equivalent HAWQ types (bigint and text). The remaining complex fields are 
mapped to text columns:
    +The generated Avro binary data file is written to 
`/tmp/pxf_hdfs_avro.avro`. Copy this file to HDFS:
     
    -``` sql
    -gpadmin=# CREATE EXTERNAL TABLE avro_complex 
    -  (id bigint, 
    -  username text, 
    -  followers text, 
    -  rank int, 
    -  fmap text, 
    -  address text, 
    -  relationship text,
    -  md5 bytea) 
    -LOCATION ('pxf://namehost:51200/tmp/avro_complex?PROFILE=Avro')
    -FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
    +``` shell
    +$ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_avro.avro /data/pxf_examples/
     ```
    +### <a id="topic_avro_querydata"></a>Querying Avro Data
    +
    +Create a queryable external table from this Avro file:
     
    -The above command uses default delimiters for separating components of the 
complex types. This command is equivalent to the one above, but it explicitly 
sets the delimiters using the Avro profile parameters:
    +-  Map the top-level primitive fields, `id` (type long) and `username` 
(type string), to their equivalent HAWQ types (bigint and text). 
    +-  Map the remaining complex fields to type text.
    +-  Explicitly set the record, map, and collection delimiters using the 
Avro profile custom options:
     
     ``` sql
    -gpadmin=# CREATE EXTERNAL TABLE avro_complex 
    -  (id bigint, 
    -  username text, 
    -  followers text, 
    -  rank int, 
    -  fmap text, 
    -  address text, 
    -  relationship text,
    -  md5 bytea) 
    -LOCATION 
('pxf://localhost:51200/tmp/avro_complex?PROFILE=Avro&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:')
    -FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
    +gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_avro(id bigint, username text, 
followers text, fmap text, relationship text, address text)
    +            LOCATION 
('pxf://namenode:51200/data/pxf_examples/pxf_hdfs_avro.avro?PROFILE=Avro&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:')
    +          FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
     ```
     
     A simple query of the external table shows the components of the complex 
type data separated with delimiters:
     
     ``` sql
    -gpadmin=# select * from avro_complex;
    +gpadmin=# SELECT * FROM pxf_hdfs_avro;
     ```
     
     ``` pre
    -id | username |  followers  |    rank     |  fmap   |    address  |  
relationship  |  md5
    -1| john | [kate,santosh] |   | {kate:10,santosh:4} | 
{number:1,street:renaissance drive,city:san jose} | FRIEND | ?zst
    -2| jim | [john,pam] | 3 | {john:3,pam:3} | {number:9,street:deer 
creek,city:palo alto} | COLLEAGUE | \020!\003\004
    + id | username |   followers    |        fmap         | relationship |     
                 address                      
    
+----+----------+----------------+--------------------+--------------+---------------------------------------------------
    +  1 | john     | [kate,santosh] | {kate:10,santosh:4} | FRIEND       | 
{number:1,street:renaissance drive,city:san jose}
    +  2 | jim      | [john,pam]     | {pam:3,john:3}      | COLLEAGUE    | 
{number:9,street:deer creek,city:palo alto}
    +(2 rows)
     ```
     
    -You can process the delimited components in the text columns as necessary 
for your application. For example, the following command uses the 
`string_to_array` function to convert entries in the "followers" field to a 
text array column in a new view. The view is then queried to filter rows based 
on whether a particular follower appears in the array:
    +Process the delimited components in the text columns as necessary for your 
application. For example, the following command uses the HAWQ internal 
`string_to_array` function to convert entries in the `followers` field to a 
text array column in a new view.
     
     ``` sql
    -gpadmin=# create view followers_view as 
    -  select username, address, string_to_array(substring(followers from 2 for 
(char_length(followers) - 2)), ',')::text[] 
    -    as followers 
    -  from avro_complex;
    -
    -gpadmin=# select username, address 
    -from followers_view 
    -where john = ANY(followers);
    +gpadmin=# CREATE VIEW followers_view AS 
    +  SELECT username, address, string_to_array(substring(followers FROM 2 FOR 
(char_length(followers) - 2)), ',')::text[] 
    +    AS followers 
    +  FROM pxf_hdfs_avro;
    +```
    +
    +Query the view to filter rows based on whether a particular follower 
appears in the array:
    +
    +```
    +gpadmin=# SELECT username, address FROM followers_view WHERE followers @> 
'{john}';
     ```
     
     ``` pre
    -username | address
    -jim | {number:9,street:deer creek,city:palo alto}
    + username |                   address                   
    +----------+---------------------------------------------
    + jim      | {number:9,street:deer creek,city:palo alto}
    +```
    +
    +## <a id="accessdataonahavhdfscluster"></a>Accessing HDFS Data in a High 
Availability HDFS Cluster
    +
    +ToÂ access external HDFS data in a High Availability HDFS cluster, change 
the `CREATE EXTERNAL TABLE` `LOCATION` clause to use \<HA-nameservice\> rather 
than  \<host\>[:\<port\>].
    +
    +``` sql
    +gpadmin=# CREATE EXTERNAL TABLE <table_name> ( <column_name> <data_type> 
[, ...] | LIKE <other_table> )
    +            LOCATION 
('pxf://<HA-nameservice>/<path-to-hdfs-file>?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro|SequenceWritable[&<custom-option>=<value>[...]]')
    +         FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);
     ```
    +
    +The opposite is true when a highly available HDFS cluster is reverted to a 
single NameNode configuration. In that case, any table definition that has 
specified \<HA-nameservice\> should use the \<host\>[:\<port\>] syntax.Â 
    +
    +
    +## <a id="hdfs_advanced"></a>Advanced
    --- End diff --
    
    This should probably be mentioned earlier, when we list the built-in 
profiles.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #33: HAWQ-1107 - enhance PXF HDFS plugin do...

Reply via email to