[
https://issues.apache.org/jira/browse/DRILL-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14877314#comment-14877314
]
Aman Sinha commented on DRILL-3808:
-----------------------------------
Thanks Jacques. Yes, setting the quote to '\0' allows the parsing to succeed
for the TSV file. We would just need to ensure this does not affect other
formats such as CSV. I will work with [~seanhychu] on getting the localized
change tested and put out for review.
> TextParsingException when reading TSV file with fields that have quoted
> sub-strings
> -----------------------------------------------------------------------------------
>
> Key: DRILL-3808
> URL: https://issues.apache.org/jira/browse/DRILL-3808
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Text & CSV
> Reporter: Sean Hsuan-Yi Chu
> Assignee: Sean Hsuan-Yi Chu
> Priority: Critical
>
> According to references [1], [2]:
> In .csv, the double quote is a special character as it can optionally enclose
> a text field. But in .tsv, it is not a special character, and it can appear
> anywhere and when it does, it should treated as a literal. The tsv format
> specification also does not provide for the tab or CR/LF characters to show
> up anywhere in text fields. However, Drill treats tsv very the same like csv.
> For an example, given data:
> {code}
> "test"\t"test"
> {code}
> A query: select columns[0], columns[1] from `t.tsv`; Drill would give
> {code}
> test test
> {code}
> However, according to the reference[2], it is supposed to be
> {code}
> "test" "test"
> {code}
> Ideally, the Drill should follow the standard see[2].
> [1] CSV - https://tools.ietf.org/html/rfc4180
> [2] TSV -
> http://www.iana.org/assignments/media-types/text/tab-separated-values
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)