[
https://issues.apache.org/jira/browse/DRILL-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogers resolved DRILL-5914.
--------------------------------
Resolution: Fixed
This issue was fixed as part of the "Complaint text reader V3" project. The
test cited in the description now correctly reports 4 lines for the
{{COUNT(*)}} query.
> CSV (text) reader fails to parse quoted newlines in trailing fields
> -------------------------------------------------------------------
>
> Key: DRILL-5914
> URL: https://issues.apache.org/jira/browse/DRILL-5914
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.11.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Major
>
> Consider the existing `TestCsvHeader.testCountOnCsvWithHeader()` unit test.
> The input file is as follows:
> {noformat}
> Year,Make,Model,Description,Price
> 1997,Ford,E350,"ac, abs, moon",3000.00
> 1999,Chevy,"Venture ""Extended Edition""","",4900.00
> 1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
> 1996,Jeep,Grand Cherokee,"MUST SELL!
> air, moon roof, loaded",4799.00
> {noformat}
> Note the newline in side the description in the last record.
> If we do a `SELECT *` query, the file is parsed fine; we get 4 records.
> If we do a `SELECT Year, Model` query, the CSV reader uses a special trick:
> it short-circuits reads on the three columns that are not wanted:
> {code}
> TextReader.parseRecord() {
> ...
> if (earlyTerm) {
> if (ch != newLine) {
> input.skipLines(1); // <-- skip lines
> }
> break;
> }
> {code}
> This method skips forward in the file, discarding characters until it hits a
> newline:
> {code}
> do {
> nextChar();
> } while (lineCount < expectedLineCount);
> {code}
> Note that this code handles individual characters, it is not aware of
> per-field semantics. That is, unlike the higher-level parser methods, the
> `nextChar()` method does not consider newlines inside of quoted fields to be
> special.
> This problem shows up acutely in a `SELECT COUNT\(*)` style query that skips
> all fields; the result is we count the input as five lines, not four.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)