[jira] [Updated] (DRILL-5914) CSV (text) reader fails to parse quoted newlines in trailing fields

Paul Rogers (JIRA) Mon, 30 Oct 2017 18:09:07 -0700

     [ 
https://issues.apache.org/jira/browse/DRILL-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Rogers updated DRILL-5914:
-------------------------------
    Description: 
Consider the existing `TestCsvHeader.testCountOnCsvWithHeader()` unit test. The 
input file is as follows:

{noformat}
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
{noformat}

Note the newline in side the description in the last record.

If we do a `SELECT *` query, the file is parsed fine; we get 4 records.

If we do a `SELECT Year, Model` query, the CSV reader uses a special trick: it 
short-circuits reads on the three columns that are not wanted:

{code}
TextReader.parseRecord() {
...
        if (earlyTerm) {
          if (ch != newLine) {
            input.skipLines(1); // <-- skip lines
          }
          break;
        }
{code}

This method skips forward in the file, discarding characters until it hits a 
newline:

{code}
      do {
        nextChar();
      } while (lineCount < expectedLineCount);
{code}

Note that this code handles individual characters, it is not aware of per-field 
semantics. That is, unlike the higher-level parser methods, the `nextChar()` 
method does not consider newlines inside of quoted fields to be special.

This problem shows up acutely in a `SELECT COUNT\(*)` style query that skips 
all fields; the result is we count the input as five lines, not four.

  was:
Consider the existing `TestCsvHeader.testCountOnCsvWithHeader()` unit test. The 
input file is as follows:

{noformat}
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
{noformat}

Note the newline in side the description in the last record.

If we do a `SELECT *` query, the file is parsed fine; we get 4 records.

If we do a `SELECT Year, Model` query, the CSV reader uses a special trick: it 
short-circuits reads on the three columns that are not wanted:

{code}
TextReader.parseRecord() {
...
        if (earlyTerm) {
          if (ch != newLine) {
            input.skipLines(1); // <-- skip lines
          }
          break;
        }
{code}

This method skips forward in the file, discarding characters until it hits a 
newline:

{code}
      do {
        nextChar();
      } while (lineCount < expectedLineCount);
{code}

Note that this code handles individual characters, it is not aware of per-field 
semantics. That is, unlike the higher-level parser methods, the `nextChar()` 
method does not consider newlines inside of quoted fields to be special.

This problem shows up acutely in a `SELECT COUNT(*)` style query that skips all 
fields; the result is we count the input as five lines, not four.


> CSV (text) reader fails to parse quoted newlines in trailing fields
> -------------------------------------------------------------------
>
>                 Key: DRILL-5914
>                 URL: https://issues.apache.org/jira/browse/DRILL-5914
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.11.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>
> Consider the existing `TestCsvHeader.testCountOnCsvWithHeader()` unit test. 
> The input file is as follows:
> {noformat}
> Year,Make,Model,Description,Price
> 1997,Ford,E350,"ac, abs, moon",3000.00
> 1999,Chevy,"Venture ""Extended Edition""","",4900.00
> 1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
> 1996,Jeep,Grand Cherokee,"MUST SELL!
> air, moon roof, loaded",4799.00
> {noformat}
> Note the newline in side the description in the last record.
> If we do a `SELECT *` query, the file is parsed fine; we get 4 records.
> If we do a `SELECT Year, Model` query, the CSV reader uses a special trick: 
> it short-circuits reads on the three columns that are not wanted:
> {code}
> TextReader.parseRecord() {
> ...
>         if (earlyTerm) {
>           if (ch != newLine) {
>             input.skipLines(1); // <-- skip lines
>           }
>           break;
>         }
> {code}
> This method skips forward in the file, discarding characters until it hits a 
> newline:
> {code}
>       do {
>         nextChar();
>       } while (lineCount < expectedLineCount);
> {code}
> Note that this code handles individual characters, it is not aware of 
> per-field semantics. That is, unlike the higher-level parser methods, the 
> `nextChar()` method does not consider newlines inside of quoted fields to be 
> special.
> This problem shows up acutely in a `SELECT COUNT\(*)` style query that skips 
> all fields; the result is we count the input as five lines, not four.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (DRILL-5914) CSV (text) reader fails to parse quoted newlines in trailing fields

Reply via email to