[
https://issues.apache.org/jira/browse/DRILL-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075143#comment-16075143
]
Roman commented on DRILL-5239:
------------------------------
[~paul-rogers],
Thank you for the great introduction at CSV files at all!
So let's combine all of the above to modify CSV reading. I want to mark all
cases which I need to make according to this ticket:
*1)* What do you think, do I need to create session/system option which treats
or skips comments (1st from my previous message) or set up "blank" comment
symbol should be enough?
*2) File has headers.* In this case I will create session/system option which
skips header in case if we treat comments as a data only ( *1)* point). But as
I understand header not always starts from comment symbol. So I think there
could be some problems to divide a header (which does not start from comment
symbol) from the data. I think we should make this option as byte type (not
boolean) where customer can add manually how many lines we should skip from the
beginning of the document (It seems customer should have own template for
header in all CSV files). Or if we have "0" (default value) - we will skip all
comment lines from top.
*3) Read headers.* It seems we duplicating *2)* point. Could you please explain
what I need to add? Maybe you have different view on *2)* and *3)* points?
*4) Skip blank lines.* Nice find! In this case I will create separate
session/system option which skips or not blank lines.
*5) Unix extensions.* In this case I will create separate session/system
option which removes "\ #" symbols from the beginning of the line and add rest
as a data.
Could you please tell me you thoughts? Maybe you have some notes that I should
make inside this Jira?
> Drill text reader reports wrong results when column value starts with '#'
> -------------------------------------------------------------------------
>
> Key: DRILL-5239
> URL: https://issues.apache.org/jira/browse/DRILL-5239
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Text & CSV
> Affects Versions: 1.10.0
> Reporter: Rahul Challapalli
> Assignee: Roman
> Priority: Blocker
> Labels: doc-impacting
>
> git.commit.id.abbrev=2af709f
> Data Set :
> {code}
> D|32
> 8h|234
> ;#|3489
> ^$*(|308
> #|98
> {code}
> Wrong Result : (Last row is missing)
> {code}
> select columns[0] as col1, columns[1] as col2 from
> dfs.`/drill/testdata/wtf2.tbl`;
> +-------+-------+
> | col1 | col2 |
> +-------+-------+
> | D | 32 |
> | 8h | 234 |
> | ;# | 3489 |
> | ^$*( | 308 |
> +-------+-------+
> 4 rows selected (0.233 seconds)
> {code}
> The issue does not however happen with a parquet file
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)