[ 
https://issues.apache.org/jira/browse/DRILL-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073940#comment-16073940
 ] 

Roman commented on DRILL-5239:
------------------------------

[~paul-rogers],

I want to add 2 system/session options:
- 1st option should treat all comments as data or skip them. I think we should 
skip comments by default.
- 2nd option should skip header (do not show header as data) if the first 
option was set to treat comments as data. 

Let me show my thoughts with examples:

{code:title=Origin file}
#Header
#Of text file
data1|data2
data3|data4
#data5|data6
{code}

So in this case we got 3 scenario:
1) 1st option skips all comments -> we will not see any lines which starts from 
comment symbol;
{code:title=Result of 1) scenario}
+-------+-------+
| col1  | col2  |
+-------+-------+
| data1 | data2 |
| data3 | data4 |
+-------+-------+
{code}
2) 1st option treats comments as data AND 2nd option skips header -> we will 
see all lines which were under header.
{code:title=Result of 2) scenario}
+--------+-------+
| col1   | col2  |
+--------+-------+
|  data1 | data2 |
|  data3 | data4 |
| #data5 | data6 |
+--------+-------+
{code}
3) 1st option treats comments as data AND 2nd option does not skip header -> we 
will see all lines.
{code:title=Result of 3) scenario}
+---------------+-------+
|      col1     | col2  |
+---------------+-------+
| #Header       | null  |
| #Of text file | null  |
| data1         | data2 |
| data3         | data4 |
| #data5        | data6 |
+---------------+-------+
{code}

So could you please tell me your view on my thoughts? Maybe you have some notes?

> Drill text reader reports wrong results when column value starts with '#'
> -------------------------------------------------------------------------
>
>                 Key: DRILL-5239
>                 URL: https://issues.apache.org/jira/browse/DRILL-5239
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 1.10.0
>            Reporter: Rahul Challapalli
>            Assignee: Roman
>            Priority: Blocker
>
> git.commit.id.abbrev=2af709f
> Data Set :
> {code}
> D|32
> 8h|234
> ;#|3489
> ^$*(|308
> #|98
> {code}
> Wrong Result : (Last row is missing)
> {code}
> select columns[0] as col1, columns[1] as col2 from 
> dfs.`/drill/testdata/wtf2.tbl`;
> +-------+-------+
> | col1  | col2  |
> +-------+-------+
> | D     | 32    |
> | 8h    | 234   |
> | ;#    | 3489  |
> | ^$*(  | 308   |
> +-------+-------+
> 4 rows selected (0.233 seconds)
> {code}
> The issue does not however happen with a parquet file



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to