[ 
https://issues.apache.org/jira/browse/DRILL-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16072784#comment-16072784
 ] 

Paul Rogers commented on DRILL-5239:
------------------------------------

Note that the '#' symbol is used in some formats to indicate a comment:

{code}
# Exported from server abcd at 2017:07:01T01:00:00
# Server log version 2.3
time,recv-ip,bytes,status,...
<data>
{code}

Since some formats allow this, a solution might be to add an option (normally 
off) that permits comments in headers. If the rule is off, then '#' is just 
another character. If it is on, then we skip comment lines until we find a 
header. In neither case do we need to allow comment lines in the data section.

There is probably a write-up of this somewhere for a format that allows 
columns. Perhaps we can track that down as a reference. (I saw the format in 
conjunction with web logs a few jobs back...)

> Drill text reader reports wrong results when column value starts with '#'
> -------------------------------------------------------------------------
>
>                 Key: DRILL-5239
>                 URL: https://issues.apache.org/jira/browse/DRILL-5239
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 1.10.0
>            Reporter: Rahul Challapalli
>            Assignee: Roman
>            Priority: Blocker
>
> git.commit.id.abbrev=2af709f
> Data Set :
> {code}
> D|32
> 8h|234
> ;#|3489
> ^$*(|308
> #|98
> {code}
> Wrong Result : (Last row is missing)
> {code}
> select columns[0] as col1, columns[1] as col2 from 
> dfs.`/drill/testdata/wtf2.tbl`;
> +-------+-------+
> | col1  | col2  |
> +-------+-------+
> | D     | 32    |
> | 8h    | 234   |
> | ;#    | 3489  |
> | ^$*(  | 308   |
> +-------+-------+
> 4 rows selected (0.233 seconds)
> {code}
> The issue does not however happen with a parquet file



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to