[jira] [Comment Edited] (DRILL-5239) Drill text reader reports wrong results when column value starts with '#'

Paul Rogers (JIRA) Tue, 04 Jul 2017 13:24:16 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074056#comment-16074056
 ]


Paul Rogers edited comment on DRILL-5239 at 7/4/17 8:23 PM:
------------------------------------------------------------

See [RFC-4180|https://tools.ietf.org/html/rfc4180] for the IETF standard for 
CSV files. This standard does not support headers. Drill's "compliant" text 
reader complies with RFC-4180.

As noted in [Wikipedia|https://en.wikipedia.org/wiki/Comma-separated_values], 
despite RFC-4180, CSV is not a well-defined format; it is instead often an 
informal theme with each system handing details slightly differently. Indeed, 
Drill violates RFC-4180 in that we allow Linux-style line terminators instead 
of the DOS-style terminators specified in the RFC.

Also came across [Super 
CSV|https://super-csv.github.io/super-csv/csv_specification.html] which has a 
nice summary of the rules in a more readable format than the RFC.

Someone has specified a [JSON 
format|http://specs.frictionlessdata.io/csv-dialect/] to describe CSV file 
formats.

So, we can see that, in normal practice, comments are not a part of the 
RFC-4180 format. So, your suggestion to have them turned off by default is good.

Let's dig into comments a bit. Here is our [first 
hint|https://stackoverflow.com/questions/1961006/can-a-csv-file-have-a-comment],
 third answer down:

{quote}
In engineering data, it is common to see the # symbol in the first column used 
to signal a comment.

I use the [ostermiller CSV parsing 
library|http://ostermiller.org/utils/CSV.html] to read and process such files. 
That library allows you to set the comment character. After the parse operation 
you get an array just containing the real data, no comments.
{quote}

The [Ostermiller Java Utilities, Comma Separated Values 
(CSV)|http://ostermiller.org/utils/CSV.html] looks to be a good source for what 
people actually do in practice; the comments in the description suggest the 
author kept tinkering with the library to handle the various file formats "in 
the wild." Of particular interest is the difference between what he calls the 
"Unix stye" and "Excel style" of CSV files. I don't think our code handles 
these differences. The bottom of the page lists a number of other CSV resources.

Maybe we should just use the Ostermiller code rather than tinkering with the 
existing code? Especially since, as part of the "memory fragmentation" project, 
I've already ripped out and replaced the header parser and will need to replace 
the two implementations that write to value vectors...

We've talked about comments. Another issue is blank lines. The [Ostermiller CSV 
lexer|http://ostermiller.org/utils/src/CSVLexer.lex.html] says it ignores blank 
lines. Perhaps we should also, since a blank line can never be a valid record 
(unless the header states that the file has only one field... You gotta love 
CSV...)

That same page shows how the lexer handles comments:

{code}
shredder.setCommentStart("#;!");
{code}

In this case, a comment character starts the comment (which, this being a 
lexer, could be anywhere in the line.) If the resulting line is blank, it is 
ignored. Perhaps overkill...

A [very complete "informal" 
specification|https://www.csvreader.com/csv_format.php] of CSV-files 
in-the-wild says that comments, if they appear, should be the first character 
on the line. (Lots of good stuff in this link!) Here is the golden nugget (and 
where I saw the comment character on a previous project): "An IIS web log file 
is a good example." A Google search found [an 
example|http://surfray.com/?id=blog&news=38133] similar to the one I made up in 
a previous post.

{code}
#Software: Microsoft Internet Information Services 6.0
#Version: 1.0
#Date: 2009-06-11 05:12:03
#Fields: date time s-sitename s-ip cs-method cs-uri-stem cs-uri-query s-port 
cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
2009-06-11 05:12:02 W3SVC1893743816 192.168.1.109 GET / – 4677 – 192.168.1.109 
Mozilla/4.0 (compatible;+MSIE+4.01;+Windows+NT;+MS+Search+5.0+Robot) 401 2 
2148074254 2009-06-11 05:12:02 W3SVC1893743816 192.168.1.109 GET / – 4677 – 
192.168.1.109 
Mozilla/4.0+(compatible;+MSIE+4.01;+Windows+NT;+MS+Search+5.0+Robot) 401 2 
2148074254
{code}

Note, however, that the above format is not rally CSV. Field names are in a 
comment, not a CSV header. Values are space-delimited (not comma delimited), 
though I've also seen tab-delimited variations. To read the above properly 
requires a specialized parser. But, still, one can get close with a SSV (space 
separated value) parser that ignores comments. Drill would read all fields into 
one big array, which is awful because different servers can be configured with 
different fields. We'll leave that issue for another time.

Drill's CSV parser is based on the [uniVocity CSV 
parser|http://docs.univocity.com/parsers/2.4.1/com/univocity/parsers/csv/CsvParser.html],
 now maintained on [GitHub|https://github.com/uniVocity/univocity-parsers]. A 
good question is whether we can just use the jar available from the GitHub site 
rather than copying the code as we have done.

Go down to the 
[Examples|https://github.com/uniVocity/univocity-parsers#reading-csv] section. 
Voila! There is a CSV file in all its glory with blank lines and comments!

So, let us set our goal to read this [example 
file|http://github.com/uniVocity/univocity-parsers/tree/master/src/test/resources/examples/example.csv].
 We need a set of options that allow us to do so. This means:

* File has headers (true/*false*)
* Read headers (true/*false*)
* Skip blank lines (*true*/false)
* Comment character (any single character, blank by default meaning no comment)
* Unix extensions (true/*false*)

If comments are enabled, a line may have leading whitespace before the comment 
character. And, in a file with comments, a value can contain a comment by 
quoting:

{code}
# A file with comment
col1, col2, col3
# Next line is a comment
#fred,10,bedrock
# So is this next one
    #fred,10,bedrock
# Next line is not a comment (Excel style)
"#fred",10,bedrock
# Next line is not a comment (Unix style)
\#fred,10,bedrock
# Not a comment, leading space in value (i.e. "  #fred"), Unix style
   \#fred,10,bedrock
{code}

The lesson in all this is that CSV files have a long history and exist in the 
field with many variations. Mature parsers have learned to handle these 
variations. Because Drill must handle data as it actually exists, not as we'd 
like it to be, Drill would be well served to learn from those who came before 
us and to correctly support the variations encoded in these various CSV 
libraries. Fortunately for us, if we just use the (newer? complete?) uniVocity 
libraries, we get that for free.


was (Author: paul-rogers):
See [RFC-4180|https://tools.ietf.org/html/rfc4180] for the IETF standard for 
CSV files. This standard does not support headers. Drill's "compliant" text 
reader complies with RFC-4180.

As noted in [Wikipedia|https://en.wikipedia.org/wiki/Comma-separated_values], 
despite RFC-4180, CSV is not a well-defined format; it is instead often an 
informal theme with each system handing details slightly differently. Indeed, 
Drill violates RFC-4180 in that we allow Linux-style line terminators instead 
of the DOS-style terminators specified in the RFC.

Also came across [Super 
CSV|https://super-csv.github.io/super-csv/csv_specification.html] which has a 
nice summary of the rules in a more readable format than the RFC.

Someone has specified a [JSON 
format|http://specs.frictionlessdata.io/csv-dialect/] to describe CSV file 
formats.

So, we can see that, in normal practice, comments are not a part of the 
RFC-4180 format. So, your suggestion to have them turned off by default is good.

Let's dig into comments a bit. Here is our [first 
hint|https://stackoverflow.com/questions/1961006/can-a-csv-file-have-a-comment],
 third answer down:

{quote}
In engineering data, it is common to see the # symbol in the first column used 
to signal a comment.

I use the [ostermiller CSV parsing 
library|http://ostermiller.org/utils/CSV.html] to read and process such files. 
That library allows you to set the comment character. After the parse operation 
you get an array just containing the real data, no comments.
{quote}

The [Ostermiller Java Utilities, Comma Separated Values 
(CSV)|http://ostermiller.org/utils/CSV.html] looks to be a good source for what 
people actually do in practice; the comments in the description suggest the 
author kept tinkering with the library to handle the various file formats "in 
the wild." Of particular interest is the difference between what he calls the 
"Unix stye" and "Excel style" of CSV files. I don't think our code handles 
these differences. The bottom of the page lists a number of other CSV resources.

Maybe we should just use the Ostermiller code rather than tinkering with the 
existing code? Especially since, as part of the "memory fragmentation" project, 
I've already ripped out and replaced the header parser and will need to replace 
the two implementations that write to value vectors...

We've talked about comments. Another issue is blank lines. The [Ostermiller CSV 
lexer|http://ostermiller.org/utils/src/CSVLexer.lex.html] says it ignores blank 
lines. Perhaps we should also, since a blank line can never be a valid record 
(unless the header states that the file has only one field... You gotta love 
CSV...)

That same page shows how the lexer handles comments:

{code}
shredder.setCommentStart("#;!");
{code}

In this case, a comment character starts the comment (which, this being a 
lexer, could be anywhere in the line.) If the resulting line is blank, it is 
ignored. Perhaps overkill...

Drill's CSV parser is based on the [uniVocity CSV 
parser|http://docs.univocity.com/parsers/2.4.1/com/univocity/parsers/csv/CsvParser.html],
 now maintained on [GitHub|https://github.com/uniVocity/univocity-parsers]. A 
good question is whether we can just use the jar available from the GitHub site 
rather than copying the code as we have done.

Go down to the 
[Examples|https://github.com/uniVocity/univocity-parsers#reading-csv] section. 
Voila! There is a CSV file in all its glory with blank lines and comments!

So, let us set our goal to read this [example 
file|http://github.com/uniVocity/univocity-parsers/tree/master/src/test/resources/examples/example.csv].
 We need a set of options that allow us to do so. This means:

* File has headers (true/*false*)
* Read headers (true/*false*)
* Skip blank lines (*true*/false)
* Comment character (any single character, blank by default meaning no comment)
* Unix extensions (true/*false*)

If comments are enabled, a line may have leading whitespace before the comment 
character. And, in a file with comments, a value can contain a comment by 
quoting:

{code}
# A file with comment
col1, col2, col3
# Next line is a comment
#fred,10,bedrock
# So is this next one
    #fred,10,bedrock
# Next line is not a comment (Excel style)
"#fred",10,bedrock
# Next line is not a comment (Unix style)
\#fred,10,bedrock
# Not a comment, leading space in value (i.e. "  #fred"), Unix style
   \#fred,10,bedrock
{code}

The lesson in all this is that CSV files have a long history and exist in the 
field with many variations. Mature parsers have learned to handle these 
variations. Because Drill must handle data as it actually exists, not as we'd 
like it to be, Drill would be well served to learn from those who came before 
us and to correctly support the variations encoded in these various CSV 
libraries. Fortunately for us, if we just use the (newer? complete?) uniVocity 
libraries, we get that for free.

> Drill text reader reports wrong results when column value starts with '#'
> -------------------------------------------------------------------------
>
>                 Key: DRILL-5239
>                 URL: https://issues.apache.org/jira/browse/DRILL-5239
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 1.10.0
>            Reporter: Rahul Challapalli
>            Assignee: Roman
>            Priority: Blocker
>              Labels: doc-impacting
>
> git.commit.id.abbrev=2af709f
> Data Set :
> {code}
> D|32
> 8h|234
> ;#|3489
> ^$*(|308
> #|98
> {code}
> Wrong Result : (Last row is missing)
> {code}
> select columns[0] as col1, columns[1] as col2 from 
> dfs.`/drill/testdata/wtf2.tbl`;
> +-------+-------+
> | col1  | col2  |
> +-------+-------+
> | D     | 32    |
> | 8h    | 234   |
> | ;#    | 3489  |
> | ^$*(  | 308   |
> +-------+-------+
> 4 rows selected (0.233 seconds)
> {code}
> The issue does not however happen with a parquet file



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (DRILL-5239) Drill text reader reports wrong results when column value starts with '#'

Reply via email to