[
https://issues.apache.org/jira/browse/FLINK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486989#comment-16486989
]
Fabian Hueske edited comment on FLINK-6016 at 5/23/18 9:42 AM:
---------------------------------------------------------------
This problem cannot be solved with the current implementation of the
{{CsvInputFormat}} which is based on {{DelimitedInputFormat}}.
In the current implementation, the file is first split into rows (without
looking at quote characters) and then each row is parsed. This behavior is
pretty much baked in and cannot be easily changed.
There is a [PR that uses a CSV parsing
library|https://github.com/apache/flink/pull/4660] to scan CSV files and
handles this case better.
However, in general row delimiters in quoted strings can only be properly
processed, if we read CSV files as a whole, i.e., without splitting them into
smaller chunks which are read in parallel by different tasks.
was (Author: fhueske):
This problem cannot be solve with the current implementation of the
{{CsvInputFormat}} which is based on {{DelimitedInputFormat}}.
In the current implementation, the file is first split into rows (without
looking at quote characters) and then each row is parsed. This behavior is
pretty much baked in and cannot be easily changed.
There is a [PR that uses a CSV parsing
library|https://github.com/apache/flink/pull/4660] to scan CSV files and
handles this case better.
However, in general row delimiters in quoted strings can only be properly
processed, if we read CSV files as a whole, i.e., without splitting them into
smaller chunks which are read in parallel by different tasks.
> Newlines should be valid in quoted strings in CSV
> -------------------------------------------------
>
> Key: FLINK-6016
> URL: https://issues.apache.org/jira/browse/FLINK-6016
> Project: Flink
> Issue Type: Bug
> Components: Batch Connectors and Input/Output Formats
> Affects Versions: 1.2.0
> Reporter: Luke Hutchison
> Priority: Major
>
> The RFC for the CSV format specifies that newlines are valid in quoted
> strings in CSV:
> https://tools.ietf.org/html/rfc4180
> However, when parsing a CSV file with Flink containing a newline, such as:
> {noformat}
> "3
> 4",5
> {noformat}
> you get this exception:
> {noformat}
> Line could not be parsed: '"3'
> ParserError UNTERMINATED_QUOTED_STRING
> Expect field types: class java.lang.String, class java.lang.String
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)