[ 
https://issues.apache.org/jira/browse/DRILL-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075143#comment-16075143
 ] 

Roman commented on DRILL-5239:
------------------------------

[~paul-rogers],
Thank you for the great introduction at CSV files at all!

So let's combine all of the above to modify CSV reading. I want to mark all 
cases which I need to make according to this ticket:

*1)* What do you think, do I need to create session/system option which treats 
or skips comments (1st from my previous message) or set up "blank" comment 
symbol should be enough?
*2) File has headers.* In this case I will create session/system option which 
skips header in case if we treat comments as a data only ( *1)* point). But as 
I understand header not always starts from comment symbol. So I think there 
could be some problems to divide a header (which does not start from comment 
symbol) from the data. I think we should make this option as byte type (not 
boolean) where customer can add manually how many lines we should skip from the 
beginning of the document (It seems customer should have own template for 
header in all CSV files). Or if we have "0" (default value) - we will skip all 
comment lines from top.
*3) Read headers.* It seems we duplicating *2)* point. Could you please explain 
what I need to add? Maybe you have different view on *2)* and *3)* points? 
*4) Skip blank lines.* Nice find! In this case I will create separate 
session/system option which skips or not blank lines.
*5) Unix extensions.* In this case  I will create separate session/system 
option which removes "\ #" symbols from the beginning of the line and add rest 
as a data.

Could you please tell me you thoughts? Maybe you have some notes that I should 
make inside this Jira?

> Drill text reader reports wrong results when column value starts with '#'
> -------------------------------------------------------------------------
>
>                 Key: DRILL-5239
>                 URL: https://issues.apache.org/jira/browse/DRILL-5239
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 1.10.0
>            Reporter: Rahul Challapalli
>            Assignee: Roman
>            Priority: Blocker
>              Labels: doc-impacting
>
> git.commit.id.abbrev=2af709f
> Data Set :
> {code}
> D|32
> 8h|234
> ;#|3489
> ^$*(|308
> #|98
> {code}
> Wrong Result : (Last row is missing)
> {code}
> select columns[0] as col1, columns[1] as col2 from 
> dfs.`/drill/testdata/wtf2.tbl`;
> +-------+-------+
> | col1  | col2  |
> +-------+-------+
> | D     | 32    |
> | 8h    | 234   |
> | ;#    | 3489  |
> | ^$*(  | 308   |
> +-------+-------+
> 4 rows selected (0.233 seconds)
> {code}
> The issue does not however happen with a parquet file



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to