[
https://issues.apache.org/jira/browse/FLINK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486989#comment-16486989
]
Fabian Hueske commented on FLINK-6016:
--------------------------------------
This problem cannot be solve with the current implementation of the
{{CsvInputFormat}} which is based on {{DelimitedInputFormat}}.
In the current implementation, the file is first split into rows (without
looking at quote characters) and then each row is parsed. This behavior is
pretty much baked in and cannot be easily changed.
There is a [PR that uses a CSV parsing
library|https://github.com/apache/flink/pull/4660] to scan CSV files and
handles this case better.
However, in general row delimiters in quoted strings can only be properly
processed, if we read CSV files as a whole, i.e., without splitting them into
smaller chunks which are read in parallel by different tasks.
> Newlines should be valid in quoted strings in CSV
> -------------------------------------------------
>
> Key: FLINK-6016
> URL: https://issues.apache.org/jira/browse/FLINK-6016
> Project: Flink
> Issue Type: Bug
> Components: Batch Connectors and Input/Output Formats
> Affects Versions: 1.2.0
> Reporter: Luke Hutchison
> Priority: Major
>
> The RFC for the CSV format specifies that newlines are valid in quoted
> strings in CSV:
> https://tools.ietf.org/html/rfc4180
> However, when parsing a CSV file with Flink containing a newline, such as:
> {noformat}
> "3
> 4",5
> {noformat}
> you get this exception:
> {noformat}
> Line could not be parsed: '"3'
> ParserError UNTERMINATED_QUOTED_STRING
> Expect field types: class java.lang.String, class java.lang.String
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)