[jira] [Commented] (DRILL-5239) Drill text reader reports wrong results when column value starts with '#'

Paul Rogers (JIRA) Wed, 05 Jul 2017 11:27:26 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075198#comment-16075198
 ]


Paul Rogers commented on DRILL-5239:
------------------------------------

We'll need to coordinate closely on this as I'm actively reworking part of the 
CSV reader. Having two projects going on at the same time will lead to very 
difficult code merges as I'm reworking the "top" levels of this reader. The 
focus of my work is, however, the classes which write fields into value 
vectors. If your work focuses on the parser, we may be able to minimize the 
code merge cost.

For item 1: simply try to set the comment character to blank to see if this 
addresses the original issue. If so, then the original issue is resolved.

For items 2-5: I'd like to suggest we could really use a spec for how CSV 
support is supposed to work in Drill. What is the supported syntax? When there 
are conflicting syntax options (# as comment vs. data, say), what options are 
available to clarify the choice? My notes provided links to many that others 
have created; they show we don't need very many pages to specify Drill's CSV 
behavior. Then, the community can review the spec to see if everyone agrees.

>From that spec, we can identify where Drill has gaps. Identify tasks needed to 
>evolve the code from what it does now to do what the spec says. Note that if 
>this involves changing Drill's behavior, we'll have to be careful to avoid 
>breaking existing use cases: we may need options for backward compatibility.

Let me put this a slightly different way. The behavior we have now has evolved 
as people have tinkered with the code to add this feature or that, without an 
overall description of how CSV support is supposed to work in Drill. If 
tinkering without a spec got us into this mess, tinkering without a spec is 
unlikely to get us out. So, next step is a (brief) spec.

We can then figure out items 2-5 in the context of our overall CSV support as 
compared with other (mature) libraries do.

> Drill text reader reports wrong results when column value starts with '#'
> -------------------------------------------------------------------------
>
>                 Key: DRILL-5239
>                 URL: https://issues.apache.org/jira/browse/DRILL-5239
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 1.10.0
>            Reporter: Rahul Challapalli
>            Assignee: Roman
>            Priority: Blocker
>              Labels: doc-impacting
>
> git.commit.id.abbrev=2af709f
> Data Set :
> {code}
> D|32
> 8h|234
> ;#|3489
> ^$*(|308
> #|98
> {code}
> Wrong Result : (Last row is missing)
> {code}
> select columns[0] as col1, columns[1] as col2 from 
> dfs.`/drill/testdata/wtf2.tbl`;
> +-------+-------+
> | col1  | col2  |
> +-------+-------+
> | D     | 32    |
> | 8h    | 234   |
> | ;#    | 3489  |
> | ^$*(  | 308   |
> +-------+-------+
> 4 rows selected (0.233 seconds)
> {code}
> The issue does not however happen with a parquet file



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5239) Drill text reader reports wrong results when column value starts with '#'

Reply via email to