[jira] [Comment Edited] (DRILL-5239) Drill text reader reports wrong results when column value starts with '#'

Paul Rogers (JIRA) Tue, 04 Jul 2017 18:57:13 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074163#comment-16074163
 ]


Paul Rogers edited comment on DRILL-5239 at 7/5/17 1:56 AM:
------------------------------------------------------------

See {{TextFormatPlugin.TextFormatConfig}} for current defaults. We probably 
cannot change these else we risk breaking existing apps:

{code}
    public String lineDelimiter = "\n";
    public char fieldDelimiter = '\n';
    public char quote = '"';
    public char escape = '"';
    public char comment = '#';
    public boolean skipFirstLine = false;
    public boolean extractHeader = false;
{code}

Notice that comments are enabled because {{comment}} is set. To disable them, 
presumably set the {{comment}} property to empty. That is the quick workaround 
for the person who filed this bug.

The {{lineDelimiter}} is unnecessary. Reliable code would simply skip all 
combinations of [\n\r]+. Doing so effectively skips blank lines, which is fine 
most of the time. (It can be ambiguous if the file has only one column, and the 
value of that column is an empty string. Easy work-around: quote the empty 
string.)

The {{fieldDelimiter}} default is completely wrong. One cannot use the same 
character for both the field and line delimiter; yet we do. That is a bug all 
by itself as it means the defaults support either one-field records or 
infinite-fields (depending on whether we parse the line of field delimiter 
first.) Strangely, the field delimiter is deprecated:

{code}
    @Deprecated
    @JsonProperty("delimiter")
    public void setFieldDelimiter(char delimiter) {
{code}

The escape is bizarre: it is a quote. This means that the escape sequence is 
the same as the quote sequence:

{code}
id,value
1,"I'm ""quoted"""
2,"So am "
I, but on a new line."
3,"""Another quote"", he says"
4,Is ""this"" a quote", he wonders?
{code}

Probably need to look at the code to figure this out...

Does {{skipFirstLine}} really mean, skip the first non-comment line? If not, 
what happens here:

{code}
# I'm a comment
these,are,headers
10,20,30
{code}

If we really skip just the first line, we read the headers as data. Not a good 
thing. Again, need to check the code; maybe the property is badly named.


was (Author: paul-rogers):
See {{TextFormatPlugin.TextFormatConfig}} for current defaults. We probably 
cannot change these else we risk breaking existing apps:

{code}
    public String lineDelimiter = "\n";
    public char fieldDelimiter = '\n';
    public char quote = '"';
    public char escape = '"';
    public char comment = '#';
    public boolean skipFirstLine = false;
    public boolean extractHeader = false;
{code}

Notice that comments are enabled because {{comment}} is set. To disable them, 
presumably set the {{comment}} property to empty. That is the quick workaround 
for the person who filed this bug.

The {{lineDelimiter}} is unnecessary. Reliable code would simply skip all 
combinations of [\n\r]+. Doing so effectively skips blank lines, which is fine 
most of the time. (It can be ambiguous if the file has only one column, and the 
value of that column is an empty string. Easy work-around: quote the empty 
string.)

The {{fieldDelimiter}} default is completely wrong. One cannot use the same 
character for both the field and line delimiter; yet we do. That is a bug all 
by itself as it means the defaults support either one-field records or 
infinite-fields (depending on whether we parse the line of field delimiter 
first.)

The escape is bizarre: it is a quote. This means that the escape sequence is 
the same as the quote sequence:

{code}
id,value
1,"I'm ""quoted"""
2,"So am "
I, but on a new line."
3,"""Another quote"", he says"
4,Is ""this"" a quote", he wonders?
{code}

Probably need to look at the code to figure this out...

Does {{skipFirstLine}} really mean, skip the first non-comment line? If not, 
what happens here:

{code}
# I'm a comment
these,are,headers
10,20,30
{code}

If we really skip just the first line, we read the headers as data. Not a good 
thing. Again, need to check the code; maybe the property is badly named.

> Drill text reader reports wrong results when column value starts with '#'
> -------------------------------------------------------------------------
>
>                 Key: DRILL-5239
>                 URL: https://issues.apache.org/jira/browse/DRILL-5239
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 1.10.0
>            Reporter: Rahul Challapalli
>            Assignee: Roman
>            Priority: Blocker
>              Labels: doc-impacting
>
> git.commit.id.abbrev=2af709f
> Data Set :
> {code}
> D|32
> 8h|234
> ;#|3489
> ^$*(|308
> #|98
> {code}
> Wrong Result : (Last row is missing)
> {code}
> select columns[0] as col1, columns[1] as col2 from 
> dfs.`/drill/testdata/wtf2.tbl`;
> +-------+-------+
> | col1  | col2  |
> +-------+-------+
> | D     | 32    |
> | 8h    | 234   |
> | ;#    | 3489  |
> | ^$*(  | 308   |
> +-------+-------+
> 4 rows selected (0.233 seconds)
> {code}
> The issue does not however happen with a parquet file



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (DRILL-5239) Drill text reader reports wrong results when column value starts with '#'

Reply via email to