[ https://issues.apache.org/jira/browse/HAWQ-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866238#comment-15866238 ]
Kyle R Dunn edited comment on HAWQ-1234 at 2/14/17 5:55 PM: ------------------------------------------------------------ I did some initial exploration of the HAWQ -> PXF communication chain, for a different purpose. I'm going to paste in what I've learned so far. Also, PXF itself does not store metadata, either HAWQ provides this directly or HCatalog can be queried for it; I'm showing the latter. PXF expects the metadata about the data, as well as some other pieces, to be provided as HTTP headers, which it appears to convert to a hashmap on the server side, as shown [here|https://github.com/apache/incubator-hawq/blob/master/pxf/pxf-service/src/main/java/org/apache/hawq/pxf/service/rest/RestResource.java#L52]. Get the PXF server version {code} $ curl 'http://localhost:51200/pxf/ProtocolVersion' { "version": "v14"} {code} Get metadata from HCatalog for a Hive table called "kdtest" in the "default" database {code} $ curl -i -H "X-GP-SEGMENT-ID: -100005432" -H "X-GP-SEGMENT-COUNT: 0" -H "X-GP-XID: 2724107" -H "X-GP-ALIGNMENT: 8" -H "X-GP-URL-HOST: localhost" -H "X-GP-URL-PORT: 51200" -H "X-GP-URI: localhost:51200/" -H "X-GP-HAS-FILTER: 0" 'localhost:51200/pxf/v14/Metadata/getMetadata?profile=Hive&pattern=default.kdtest' HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Content-Type: application/json Content-Length: 132 Date: Tue, 14 Feb 2017 05:06:11 GMT {"PXFMetadata":[{"item":{"path":"default","name":"kdtest"},"fields":[{"name":"key","type":"text"},{"name":"value","type":"text"}]}]} {code} Get the actual data (in {{TEXT}} format, {{GPDBWritable}} is also valid) for the above table's PXF "Fragments" {code} $ curl -i -H "X-GP-SEGMENT-ID: -100005432" -H "X-GP-SEGMENT-COUNT: 0" -H "X-GP-XID: 2724107" -H "X-GP-ALIGNMENT: 8" -H "X-GP-URL-HOST: localhost" -H "X-GP-URL-PORT: 51200" -H "X-GP-URI: pxf://localhost:51200/default.kdtest?Profile=Hive" -H "X-GP-HAS-FILTER: 0" -H "X-GP-FORMAT: TEXT" -H "X-GP-ATTRS: 2" -H "X-GP-ATTR-NAME0: key" -H "X-GP-ATTR-TYPECODE0: 25" -H "X-GP-ATTR-TYPENAME0: text" -H "X-GP-ATTR-NAME1: value" -H "X-GP-ATTR-TYPECODE1: 25" -H "X-GP-ATTR-TYPENAME1: text" -H "X-GP-Profile: Hive" -H "X-GP-DATA-DIR: default.kdtest" 'http://localhost:51200/pxf/v14/Fragmenter/getFragments?path=/apps/hive/warehouse/kdtest' HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Content-Type: application/json Content-Length: 1305 Date: Tue, 14 Feb 2017 05:30:05 GMT {"PXFFragments":[{"sourceName":"/apps/hive/warehouse/kdtest/hive-test-data.txt","index":0,"replicas":["10.215.181.12","10.215.181.11"],"metadata":"rO0ABXcQAAAAAAAAAAAAAAAAAAAAN3VyABNbTGphdmEubGFuZy5TdHJpbmc7rdJW5+kde0cCAAB4cAAAAAJ0AB1jbHBxbjFwZGhkYmRuMDIuaW5mb3NvbGNvLm5ldHQAHWNscHFuMXBkaGRiZG4wMS5pbmZvc29sY28ubmV0","userData":"b3JnLmFwYWNoZS5oYWRvb3AubWFwcmVkLlRleHRJbnB1dEZvcm1hdCFIVUREIW9yZy5hcGFjaGUuaGFkb29wLmhpdmUuc2VyZGUyLmxhenkuTGF6eVNpbXBsZVNlckRlIUhVREQhIwojTW9uIEZlYiAxMyAyMToyOTozNSBQU1QgMjAxNwpuYW1lPWRlZmF1bHQua2R0ZXN0Cm51bUZpbGVzPTEKZmllbGQuZGVsaW09LApjb2x1bW5zLnR5cGVzPXN0cmluZ1w6c3RyaW5nCnNlcmlhbGl6YXRpb24uZGRsPXN0cnVjdCBrZHRlc3QgeyBzdHJpbmcga2V5LCBzdHJpbmcgdmFsdWV9CmNvbHVtbnM9a2V5LHZhbHVlCnNlcmlhbGl6YXRpb24uZm9ybWF0PSwKY29sdW1ucy5jb21tZW50cz1cdTAwMDAKYnVja2V0X2NvdW50PS0xCnNlcmlhbGl6YXRpb24ubGliPW9yZy5hcGFjaGUuaGFkb29wLmhpdmUuc2VyZGUyLmxhenkuTGF6eVNpbXBsZVNlckRlCkNPTFVNTl9TVEFUU19BQ0NVUkFURT10cnVlCmZpbGUuaW5wdXRmb3JtYXQ9b3JnLmFwYWNoZS5oYWRvb3AubWFwcmVkLlRleHRJbnB1dEZvcm1hdAp0b3RhbFNpemU9NTUKZmlsZS5vdXRwdXRmb3JtYXQ9b3JnLmFwYWNoZS5oYWRvb3AuaGl2ZS5xbC5pby5IaXZlSWdub3JlS2V5VGV4dE91dHB1dEZvcm1hdApsb2NhdGlvbj1oZGZzXDovL2NscHFuMXBkaGRibW4wMS5pbmZvc29sY28ubmV0XDo4MDIwL2FwcHMvaGl2ZS93YXJlaG91c2Uva2R0ZXN0CnRyYW5zaWVudF9sYXN0RGRsVGltZT0xNDg3MDA2NDg4CiFIVUREISFITlBUISFIVUREIWZhbHNl"}]} {code} The Hive table looks like this: {code} hive> describe formatted kdtest; OK # col_name data_type comment key string value string # Detailed Table Information Database: default Owner: kdunn CreateTime: Mon Feb 13 09:20:40 PST 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://nowhere.com:8020/apps/hive/warehouse/kdtest Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE true numFiles 1 totalSize 55 transient_lastDdlTime 1487006488 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim , serialization.format , Time taken: 0.373 seconds, Fetched: 31 row(s) {code} The data in it is this: {code} hive> select * from kdtest; OK somekey somevalue 1234 56789 hello world aloha mondays Time taken: 0.043 seconds, Fetched: 4 row(s) {code} The raw data was this: {code} $ cat /tmp/hive-test-data.txt somekey,somevalue 1234,56789 hello,world aloha,mondays {code} Hive DDL and DML: {code} hive> CREATE TABLE kdtest (key string, value string) row format delimited fields terminated by ','; hive> LOAD DATA local inpath '/tmp/hive-test-data.txt' into table test; {code} was (Author: kdunn926): I did some initial exploration of the HAWQ -> PXF communication chain, for a different purpose. I'm going to paste in what I've learned so far. Also, PXF itself does not store metadata, either HAWQ provides this directly or HCatalog can be queried for it; I'm showing the latter. PXF expects the metadata about the data, as well as some other pieces, to be provided as HTTP headers, which it appears to convert to a hashmap on the server side, as shown [here|https://github.com/apache/incubator-hawq/blob/master/pxf/pxf-service/src/main/java/org/apache/hawq/pxf/service/rest/RestResource.java#L52]. Get the PXF server version {code} $ curl 'http://localhost:51200/pxf/ProtocolVersion' { "version": "v14"} {code} Get metadata from HCatalog for a Hive table called "kdtest" in the "default" database {code} $ curl -i -H "X-GP-SEGMENT-ID: -100005432" -H "X-GP-SEGMENT-COUNT: 0" -H "X-GP-XID: 2724107" -H "X-GP-ALIGNMENT: 8" -H "X-GP-URL-HOST: localhost" -H "X-GP-URL-PORT: 51200" -H "X-GP-URI: localhost:51200/" -H "X-GP-HAS-FILTER: 0" 'localhost:51200/pxf/v14/Metadata/getMetadata?profile=Hive&pattern=default.kdtest' HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Content-Type: application/json Content-Length: 132 Date: Tue, 14 Feb 2017 05:06:11 GMT {"PXFMetadata":[{"item":{"path":"default","name":"kdtest"},"fields":[{"name":"key","type":"text"},{"name":"value","type":"text"}]}]} {code} Get the actual data (in {{TEXT}} format, {{GPDBWritable}} is also valid) for the above table's PXF "Fragments" {code} $ curl -i -H "X-GP-SEGMENT-ID: -100005432" -H "X-GP-SEGMENT-COUNT: 0" -H "X-GP-XID: 2724107" -H "X-GP-ALIGNMENT: 8" -H "X-GP-URL-HOST: localhost" -H "X-GP-URL-PORT: 51200" -H "X-GP-URI: pxf://localhost:51200/default.kdtest?Profile=Hive" -H "X-GP-HAS-FILTER: 0" -H "X-GP-FORMAT: TEXT" -H "X-GP-ATTRS: 2" -H "X-GP-ATTR-NAME0: key" -H "X-GP-ATTR-TYPECODE0: 25" -H "X-GP-ATTR-TYPENAME0: text" -H "X-GP-ATTR-NAME1: value" -H "X-GP-ATTR-TYPECODE1: 25" -H "X-GP-ATTR-TYPENAME1: text" -H "X-GP-Profile: Hive" -H "X-GP-DATA-DIR: default.kdtest" 'http://localhost:51200/pxf/v14/Fragmenter/getFragments?path=/apps/hive/warehouse/kdtest' HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Content-Type: application/json Content-Length: 1305 Date: Tue, 14 Feb 2017 05:30:05 GMT {"PXFFragments":[{"sourceName":"/apps/hive/warehouse/kdtest/hive-test-data.txt","index":0,"replicas":["10.215.181.12","10.215.181.11"],"metadata":"rO0ABXcQAAAAAAAAAAAAAAAAAAAAN3VyABNbTGphdmEubGFuZy5TdHJpbmc7rdJW5+kde0cCAAB4cAAAAAJ0AB1jbHBxbjFwZGhkYmRuMDIuaW5mb3NvbGNvLm5ldHQAHWNscHFuMXBkaGRiZG4wMS5pbmZvc29sY28ubmV0","userData":"b3JnLmFwYWNoZS5oYWRvb3AubWFwcmVkLlRleHRJbnB1dEZvcm1hdCFIVUREIW9yZy5hcGFjaGUuaGFkb29wLmhpdmUuc2VyZGUyLmxhenkuTGF6eVNpbXBsZVNlckRlIUhVREQhIwojTW9uIEZlYiAxMyAyMToyOTozNSBQU1QgMjAxNwpuYW1lPWRlZmF1bHQua2R0ZXN0Cm51bUZpbGVzPTEKZmllbGQuZGVsaW09LApjb2x1bW5zLnR5cGVzPXN0cmluZ1w6c3RyaW5nCnNlcmlhbGl6YXRpb24uZGRsPXN0cnVjdCBrZHRlc3QgeyBzdHJpbmcga2V5LCBzdHJpbmcgdmFsdWV9CmNvbHVtbnM9a2V5LHZhbHVlCnNlcmlhbGl6YXRpb24uZm9ybWF0PSwKY29sdW1ucy5jb21tZW50cz1cdTAwMDAKYnVja2V0X2NvdW50PS0xCnNlcmlhbGl6YXRpb24ubGliPW9yZy5hcGFjaGUuaGFkb29wLmhpdmUuc2VyZGUyLmxhenkuTGF6eVNpbXBsZVNlckRlCkNPTFVNTl9TVEFUU19BQ0NVUkFURT10cnVlCmZpbGUuaW5wdXRmb3JtYXQ9b3JnLmFwYWNoZS5oYWRvb3AubWFwcmVkLlRleHRJbnB1dEZvcm1hdAp0b3RhbFNpemU9NTUKZmlsZS5vdXRwdXRmb3JtYXQ9b3JnLmFwYWNoZS5oYWRvb3AuaGl2ZS5xbC5pby5IaXZlSWdub3JlS2V5VGV4dE91dHB1dEZvcm1hdApsb2NhdGlvbj1oZGZzXDovL2NscHFuMXBkaGRibW4wMS5pbmZvc29sY28ubmV0XDo4MDIwL2FwcHMvaGl2ZS93YXJlaG91c2Uva2R0ZXN0CnRyYW5zaWVudF9sYXN0RGRsVGltZT0xNDg3MDA2NDg4CiFIVUREISFITlBUISFIVUREIWZhbHNl"}]} {code} The Hive table looks like this: {code} hive> describe formatted kdtest; OK # col_name data_type comment key string value string # Detailed Table Information Database: default Owner: kdunn CreateTime: Mon Feb 13 09:20:40 PST 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://clpqn1pdhdbmn01.infosolco.net:8020/apps/hive/warehouse/kdtest Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE true numFiles 1 totalSize 55 transient_lastDdlTime 1487006488 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim , serialization.format , Time taken: 0.373 seconds, Fetched: 31 row(s) {code} The data in it is this: {code} hive> select * from kdtest; OK somekey somevalue 1234 56789 hello world aloha mondays Time taken: 0.043 seconds, Fetched: 4 row(s) {code} The raw data was this: {code} $ cat /tmp/hive-test-data.txt somekey,somevalue 1234,56789 hello,world aloha,mondays {code} Hive DDL and DML: {code} hive> CREATE TABLE kdtest (key string, value string) row format delimited fields terminated by ','; hive> LOAD DATA local inpath '/tmp/hive-test-data.txt' into table test; {code} > Document HAWQ to PXF APIs > ------------------------- > > Key: HAWQ-1234 > URL: https://issues.apache.org/jira/browse/HAWQ-1234 > Project: Apache HAWQ > Issue Type: Sub-task > Components: PXF > Reporter: Roman Shaposhnik > Assignee: Roman Shaposhnik > Attachments: PXFAdvancedStatsplan.pdf > > > It would be very useful to start documenting HAWQ to PXF APIs. The right > places to start are: > * libcurl (a thin wrapper for making HAWQ C code be able to do REST calls): > https://github.com/apache/incubator-hawq/blob/master/src/include/access/libchurl.h > https://github.com/apache/incubator-hawq/blob/master/src/backend/access/external/libchurl.c > * pxfmasterapi (mostly metadata calls that master is doing): > https://github.com/apache/incubator-hawq/blob/master/src/backend/access/external/pxfmasterapi.c > Here you will find how HAWQ via PXF pulls using a REST API to get external > metadata and some logic to parse the JSON response. > * gpbridgeapi (segment calls to PXF): > https://github.com/apache/incubator-hawq/blob/master/src/bin/gpfusion/gpbridgeapi.c > Here you will find other examples of (read and write calls) used to fetch > external data. > Design doc on PXF's support for analyze (pxf's analyzer) is attached -- This message was sent by Atlassian JIRA (v6.3.15#6346)