[ https://issues.apache.org/jira/browse/HAWQ-56?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Goden Yao updated HAWQ-56: -------------------------- Description: **Repro Steps** External table definition {code:sql} drop external table if exists testtbl; create external table testtbl(i text, j text) location ('pxf://nakaphdns/tmp/testdata/*?Fragmenter=com.pivotal.pxf.plugins.hdfs.HdfsDataFragmenter&Accessor=com.pivotal.pxf.plugins.hdfs.TextFileAccessor&Resolver=com.pivotal.pxf.plugins.hdfs.TextResolver') format 'TEXT' (delimiter ',' header); select * from testtbl; {code} example with 4 segment servers and 4 test files with headers in hdfs {code:sql} gpadmin=# select * from testtbl ; i | j ---+--- 3 | c 2 | b 1 | a 4 | d (4 rows) {code} With 5 test files {code:sql} gpadmin=# select * from testtbl ; i | j ----+------- 5 | e 2 | b ID | Value 3 | c 1 | a ID | VALUE 4 | d (7 rows) {code} **Analysis** When using HEADER option, header line is removed only once per segment. As a result there will be different results depending on the number of segments/fragments are scanned. If the number of files is greater than the number of segments, the header row is included in the data for some files. If the number of files is less than or equal to the number of segments, the data retrieved is good. Thus, non-deterministic. The reason for this behavior is that header line handling is done by the external protocol code (fileam.c) which checks if the header_line flag is true, and if so skips the first line and sets the flag to false. This code calls the custom protocol code (in our case pxf) to get the next tuples, and so doesn't know if the tuples are from the same resource or not. >From what I can see, in gpfdist protocol the problem is solved by letting the >custom protocol code handle this and marking the flag as false for the >external protocol infrastructure in fileam.c. **Proposed Solution** Add a check in pxf validator function (a function that is being called as part of external table creation). This check will error out if HEADER option is used in a PXF table. Currently the validation function API only allows access to the table's URLs (in the LOCATION part of the table), and not the format options. In order to add the check an API change in the external protocol is required. was: **Repro Steps** External table definition {code:sql} drop external table if exists testtbl; create external table testtbl(i text, j text) location ('pxf://nakaphdns/tmp/testdata/*?Fragmenter=com.pivotal.pxf.plugins.hdfs.HdfsDataFragmenter&Accessor=com.pivotal.pxf.plugins.hdfs.TextFileAccessor&Resolver=com.pivotal.pxf.plugins.hdfs.TextResolver') format 'TEXT' (delimiter ',' header); select * from testtbl; {code} example with 4 segment servers and 4 test files with headers in hdfs {code:sql} gpadmin=# select * from testtbl ; i | j ---+--- 3 | c 2 | b 1 | a 4 | d (4 rows) {code} With 5 test files {code:sql} gpadmin=# select * from testtbl ; i | j ----+------- 5 | e 2 | b ID | Value 3 | c 1 | a ID | VALUE 4 | d (7 rows) {code} **Analysis** When using HEADER option, header line is removed only once per segment. As a result there will be different results depending on the number of segments/fragments are scanned. The reason for this behavior is that header line handling is done by the external protocol code (fileam.c) which checks if the header_line flag is true, and if so skips the first line and sets the flag to false. This code calls the custom protocol code (in our case pxf) to get the next tuples, and so doesn't know if the tuples are from the same resource or not. >From what I can see, in gpfdist protocol the problem is solved by letting the >custom protocol code handle this and marking the flag as false for the >external protocol infrastructure in fileam.c. **Proposed Solution** Add a check in pxf validator function (a function that is being called as part of external table creation). This check will error out if HEADER option is used in a PXF table. Currently the validation function API only allows access to the table's URLs (in the LOCATION part of the table), and not the format options. In order to add the check an API change in the external protocol is required. > Non-deterministic header results with "HEADER" option from external table > ------------------------------------------------------------------------- > > Key: HAWQ-56 > URL: https://issues.apache.org/jira/browse/HAWQ-56 > Project: Apache HAWQ > Issue Type: Bug > Components: External Tables & PXF > Reporter: Goden Yao > Priority: Critical > > **Repro Steps** > External table definition > {code:sql} > drop external table if exists testtbl; > create external table testtbl(i text, j text) > location > ('pxf://nakaphdns/tmp/testdata/*?Fragmenter=com.pivotal.pxf.plugins.hdfs.HdfsDataFragmenter&Accessor=com.pivotal.pxf.plugins.hdfs.TextFileAccessor&Resolver=com.pivotal.pxf.plugins.hdfs.TextResolver') > format 'TEXT' (delimiter ',' header); > select * from testtbl; > {code} > example with 4 segment servers and 4 test files with headers in hdfs > {code:sql} > gpadmin=# select * from testtbl ; > i | j > ---+--- > 3 | c > 2 | b > 1 | a > 4 | d > (4 rows) > {code} > With 5 test files > {code:sql} > gpadmin=# select * from testtbl ; > i | j > ----+------- > 5 | e > 2 | b > ID | Value > 3 | c > 1 | a > ID | VALUE > 4 | d > (7 rows) > {code} > **Analysis** > When using HEADER option, header line is removed only once per segment. > As a result there will be different results depending on the number of > segments/fragments are scanned. If the number of files is greater than the > number of segments, the header row is included in the data for some files. > If the number of files is less than or equal to the number of segments, the > data retrieved is good. Thus, non-deterministic. > The reason for this behavior is that header line handling is done by the > external protocol code (fileam.c) which checks if the header_line flag is > true, and if so skips the first line and sets the flag to false. This code > calls the custom protocol code (in our case pxf) to get the next tuples, and > so doesn't know if the tuples are from the same resource or not. > From what I can see, in gpfdist protocol the problem is solved by letting the > custom protocol code handle this and marking the flag as false for the > external protocol infrastructure in fileam.c. > **Proposed Solution** > Add a check in pxf validator function (a function that is being called as > part of external table creation). > This check will error out if HEADER option is used in a PXF table. > Currently the validation function API only allows access to the table's URLs > (in the LOCATION part of the table), and not the format options. In order to > add the check an API change in the external protocol is required. -- This message was sent by Atlassian JIRA (v6.3.4#6332)