[ 
https://issues.apache.org/jira/browse/HAWQ-56?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030906#comment-16030906
 ] 

Jacek Dobrowolski commented on HAWQ-56:
---------------------------------------

It gives non deterministic results on profile PROFILE=HdfsTextSimple, as it 
parallely reads each file block.

However when using profile PROFILE=HdfsTextMulti in pxf table definition, table 
is read serially and only first file line is stripped - and that works 
correctly.

Please reopen this and allow using HEADER with profile PROFILE=HdfsTextMulti.

> Non-deterministic header results with "HEADER" option from external table
> -------------------------------------------------------------------------
>
>                 Key: HAWQ-56
>                 URL: https://issues.apache.org/jira/browse/HAWQ-56
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: PXF
>            Reporter: Goden Yao
>            Assignee: Noa Horn
>            Priority: Critical
>             Fix For: 2.0.0.0-incubating
>
>
> *Repro Steps*
> External table definition
> {code:sql}
> drop external table if exists testtbl;
> create external table testtbl(i text, j text)
> location 
> ('pxf://nakaphdns/tmp/testdata/*?Fragmenter=com.pivotal.pxf.plugins.hdfs.HdfsDataFragmenter&Accessor=com.pivotal.pxf.plugins.hdfs.TextFileAccessor&Resolver=com.pivotal.pxf.plugins.hdfs.TextResolver')
> format 'TEXT' (delimiter ',' header);
> select * from testtbl;
> {code}
> example with 4 segment servers and 4 test files with headers in hdfs
> {code:sql}
> gpadmin=# select * from testtbl ;
>  i | j
> ---+---
>  3 | c
>  2 | b
>  1 | a
>  4 | d
> (4 rows)
> {code}
> With 5 test files
> {code:sql}
> gpadmin=# select * from testtbl ;
>  i  |   j
> ----+-------
>  5  | e
>  2  | b
>  ID | Value
>  3  | c
>  1  | a
>  ID | VALUE
>  4  | d
> (7 rows)
> {code}
> *Analysis*
> When using HEADER option, header line is removed only once per segment.
> As a result there will be different results depending on the number of 
> segments/fragments are scanned. If the number of files is greater than the 
> number of segments, the header row is included in the data for some files.
> If the number of files is less than or equal to the number of segments, the 
> data retrieved is good. Thus, non-deterministic.
> The reason for this behavior is that header line handling is done by the 
> external protocol code (fileam.c) which checks if the header_line flag is 
> true, and if so skips the first line and sets the flag to false. This code 
> calls the custom protocol code (in our case pxf) to get the next tuples, and 
> so doesn't know if the tuples are from the same resource or not.
> From what I can see, in gpfdist protocol the problem is solved by letting the 
> custom protocol code handle this and marking the flag as false for the 
> external protocol infrastructure in fileam.c.
> *Proposed Solution*
> Add a check in pxf validator function (a function that is being called as 
> part of external table creation).
> This check will error out if HEADER option is used in a PXF table.
> Currently the validation function API only allows access to the table's URLs 
> (in the LOCATION part of the table), and not the format options. In order to 
> add the check an API change in the external protocol is required.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to