[
https://issues.apache.org/jira/browse/DRILL-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075198#comment-16075198
]
Paul Rogers commented on DRILL-5239:
------------------------------------
We'll need to coordinate closely on this as I'm actively reworking part of the
CSV reader. Having two projects going on at the same time will lead to very
difficult code merges as I'm reworking the "top" levels of this reader. The
focus of my work is, however, the classes which write fields into value
vectors. If your work focuses on the parser, we may be able to minimize the
code merge cost.
For item 1: simply try to set the comment character to blank to see if this
addresses the original issue. If so, then the original issue is resolved.
For items 2-5: I'd like to suggest we could really use a spec for how CSV
support is supposed to work in Drill. What is the supported syntax? When there
are conflicting syntax options (# as comment vs. data, say), what options are
available to clarify the choice? My notes provided links to many that others
have created; they show we don't need very many pages to specify Drill's CSV
behavior. Then, the community can review the spec to see if everyone agrees.
>From that spec, we can identify where Drill has gaps. Identify tasks needed to
>evolve the code from what it does now to do what the spec says. Note that if
>this involves changing Drill's behavior, we'll have to be careful to avoid
>breaking existing use cases: we may need options for backward compatibility.
Let me put this a slightly different way. The behavior we have now has evolved
as people have tinkered with the code to add this feature or that, without an
overall description of how CSV support is supposed to work in Drill. If
tinkering without a spec got us into this mess, tinkering without a spec is
unlikely to get us out. So, next step is a (brief) spec.
We can then figure out items 2-5 in the context of our overall CSV support as
compared with other (mature) libraries do.
> Drill text reader reports wrong results when column value starts with '#'
> -------------------------------------------------------------------------
>
> Key: DRILL-5239
> URL: https://issues.apache.org/jira/browse/DRILL-5239
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Text & CSV
> Affects Versions: 1.10.0
> Reporter: Rahul Challapalli
> Assignee: Roman
> Priority: Blocker
> Labels: doc-impacting
>
> git.commit.id.abbrev=2af709f
> Data Set :
> {code}
> D|32
> 8h|234
> ;#|3489
> ^$*(|308
> #|98
> {code}
> Wrong Result : (Last row is missing)
> {code}
> select columns[0] as col1, columns[1] as col2 from
> dfs.`/drill/testdata/wtf2.tbl`;
> +-------+-------+
> | col1 | col2 |
> +-------+-------+
> | D | 32 |
> | 8h | 234 |
> | ;# | 3489 |
> | ^$*( | 308 |
> +-------+-------+
> 4 rows selected (0.233 seconds)
> {code}
> The issue does not however happen with a parquet file
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)