[
https://issues.apache.org/jira/browse/DRILL-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073940#comment-16073940
]
Roman commented on DRILL-5239:
------------------------------
[~paul-rogers],
I want to add 2 system/session options:
- 1st option should treat all comments as data or skip them. I think we should
skip comments by default.
- 2nd option should skip header (do not show header as data) if the first
option was set to treat comments as data.
Let me show my thoughts with examples:
{code:title=Origin file}
#Header
#Of text file
data1|data2
data3|data4
#data5|data6
{code}
So in this case we got 3 scenario:
1) 1st option skips all comments -> we will not see any lines which starts from
comment symbol;
{code:title=Result of 1) scenario}
+-------+-------+
| col1 | col2 |
+-------+-------+
| data1 | data2 |
| data3 | data4 |
+-------+-------+
{code}
2) 1st option treats comments as data AND 2nd option skips header -> we will
see all lines which were under header.
{code:title=Result of 2) scenario}
+--------+-------+
| col1 | col2 |
+--------+-------+
| data1 | data2 |
| data3 | data4 |
| #data5 | data6 |
+--------+-------+
{code}
3) 1st option treats comments as data AND 2nd option does not skip header -> we
will see all lines.
{code:title=Result of 3) scenario}
+---------------+-------+
| col1 | col2 |
+---------------+-------+
| #Header | null |
| #Of text file | null |
| data1 | data2 |
| data3 | data4 |
| #data5 | data6 |
+---------------+-------+
{code}
So could you please tell me your view on my thoughts? Maybe you have some notes?
> Drill text reader reports wrong results when column value starts with '#'
> -------------------------------------------------------------------------
>
> Key: DRILL-5239
> URL: https://issues.apache.org/jira/browse/DRILL-5239
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Text & CSV
> Affects Versions: 1.10.0
> Reporter: Rahul Challapalli
> Assignee: Roman
> Priority: Blocker
>
> git.commit.id.abbrev=2af709f
> Data Set :
> {code}
> D|32
> 8h|234
> ;#|3489
> ^$*(|308
> #|98
> {code}
> Wrong Result : (Last row is missing)
> {code}
> select columns[0] as col1, columns[1] as col2 from
> dfs.`/drill/testdata/wtf2.tbl`;
> +-------+-------+
> | col1 | col2 |
> +-------+-------+
> | D | 32 |
> | 8h | 234 |
> | ;# | 3489 |
> | ^$*( | 308 |
> +-------+-------+
> 4 rows selected (0.233 seconds)
> {code}
> The issue does not however happen with a parquet file
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)