[
https://issues.apache.org/jira/browse/HAWQ-280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071381#comment-15071381
]
Ruilong Huo edited comment on HAWQ-280 at 12/25/15 6:21 AM:
------------------------------------------------------------
Root cause analysis shows that: the FIRST_1000_BAD rule dominates
REJECT_LIMIT_REACHED rule while checking rejected rows during external table
access. The resolution is to remove FIRST_1000_BAD rule.
The changed behaviour is as below:
1. Accessing external table with bad rows
{noformat}
Step 1: start gpfdist service
gpfdist -d /home/gpadmin/data/ -p 8081 -l /home/gpadmin/log/load.log &
------------------------------------------------------------------------------------------------
[1] 34635
Serving HTTP on port 8081, directory /home/gpadmin/data
Step 2: create external table
CREATE EXTERNAL TABLE test_ext (id INT, a TEXT, b TEXT, c TEXT, z TEXT)
LOCATION ('gpfdist://localhost:8081/test.csv')
FORMAT 'CSV'
LOG ERRORS INTO test_ext_err SEGMENT REJECT LIMIT 3000 ROWS;
-----------------------------------------------------------------------------------------------------
NOTICE: Error table "test_ext_err" does not exist. Auto generating an error
table with the same name
CREATE EXTERNAL TABLE
Step 3: access external table
SELECT COUNT(*) FROM test_ext;
-------------------------------------------------
NOTICE: Found 2000 data formatting errors (2000 or more input rows). Rejected
related input data.
count
-------
0
(1 row)
{noformat}
2. Copying from file with bad rows
{noformat}
Step 1: create table
CREATE TABLE test_copy (id INT, a TEXT, b TEXT, c TEXT, z TEXT);
------------------------------------------------------------------------------------------------
CREATE TABLE
Step 2: copy data in file to table in database
COPY test_copy FROM '/home/gpadmin/data/test.csv' LOG ERRORS INTO test_copy_err
SEGMENT REJECT LIMIT 3000 ROWS;
---------------------------------------------------------------------------------------------------------------
COPY test_copy FROM '/home/gpadmin/data/test.csv' LOG ERRORS INTO test_copy_err
SEGMENT REJECT LIMIT 3000 ROWS;
NOTICE: Error table "test_copy_err" does not exist. Auto generating an error
table with the same name
WARNING: The error table was created in the same transaction as this
operation. It will get dropped if transaction rolls back even if bad rows are
present
HINT: To avoid this create the error table ahead of time using: CREATE TABLE
<name> (cmdtime timestamp with time zone, relname text, filename text, linenum
integer, bytenum integer, errmsg text, rawdata text, rawbytes bytea)
NOTICE: Found 2000 data formatting errors (2000 or more input rows). Errors
logged into error table "test_copy_err"
COPY 0
{noformat}
was (Author: huor):
Root cause analysis shows that: the FIRST_1000_BAD rule dominates
REJECT_LIMIT_REACHED rule while checking rejected rows during external table
access. The resolution is to remove FIRST_1000_BAD rule.
> Error accessing external table or copying from file with bad rows
> -----------------------------------------------------------------
>
> Key: HAWQ-280
> URL: https://issues.apache.org/jira/browse/HAWQ-280
> Project: Apache HAWQ
> Issue Type: Bug
> Components: External Tables
> Affects Versions: 2.0.0-beta-incubating
> Reporter: Ruilong Huo
> Assignee: Ruilong Huo
> Attachments: test.csv
>
>
> It errors out without return result when accessing external table or copying
> from file with bad rows.
> 1. Error accessing external table with bad rows
> {noformat}
> Step 1: download attached test.csv with 2000 row which are all bad formated
> Step 2: start gpfdist service
> gpfdist -d /home/gpadmin/data/ -p 8081 -l /home/gpadmin/log/load.log &
> ------------------------------------------------------------------------------------------------
> [1] 34635
> Serving HTTP on port 8081, directory /home/gpadmin/data
> Step 3: create external table
> CREATE EXTERNAL TABLE test_ext (id INT, a TEXT, b TEXT, c TEXT, z TEXT)
> LOCATION ('gpfdist://localhost:8081/test.csv')
> FORMAT 'CSV'
> LOG ERRORS INTO test_ext_err SEGMENT REJECT LIMIT 3000 ROWS;
> -----------------------------------------------------------------------------------------------------
> NOTICE: Error table "test_ext_err" does not exist. Auto generating an error
> table with the same name
> CREATE EXTERNAL TABLE
> Step 4: access external table
> SELECT COUNT(*) FROM test_ext;
> -------------------------------------------------
> ERROR: All 1000 first rows in this segment were rejected. Aborting operation
> regardless of REJECT LIMIT value. Last error was: missing data for column "z"
> (seg0 localhost:40000 pid=35647)
> DETAIL: External table test_ext, line 1000 of
> gpfdist://localhost:8081/test.csv: "29,aaa,bbb,zzz"
> {noformat}
> 2. Error copying from file with bad rows
> {noformat}
> Step 1: download attached test.csv with 2000 row which are all bad formated
> Step 2: create table
> CREATE TABLE test_copy (id INT, a TEXT, b TEXT, c TEXT, z TEXT);
> ------------------------------------------------------------------------------------------------
> CREATE TABLE
> Step 3: copy data in file to table in database
> COPY test_copy FROM '/home/gpadmin/data/test.csv' LOG ERRORS INTO
> test_copy_err SEGMENT REJECT LIMIT 3000 ROWS;
> --------------------------------------------------------------------------------------------------------
> NOTICE: Error table "test_copy_err" does not exist. Auto generating an error
> table with the same name
> WARNING: The error table was created in the same transaction as this
> operation. It will get dropped if transaction rolls back even if bad rows are
> present
> HINT: To avoid this create the error table ahead of time using: CREATE TABLE
> <name> (cmdtime timestamp with time zone, relname text, filename text,
> linenum integer, bytenum integer, errmsg text, rawdata text, rawbytes bytea)
> ERROR: All 1000 first rows in this segment were rejected. Aborting operation
> regardless of REJECT LIMIT value. Last error was: missing data for column "a"
> CONTEXT: COPY test_copy, line 1000: "29,aaa,bbb,zzz"
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)