[ 
https://issues.apache.org/jira/browse/FLINK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486989#comment-16486989
 ] 

Fabian Hueske edited comment on FLINK-6016 at 5/23/18 9:42 AM:
---------------------------------------------------------------

This problem cannot be solved with the current implementation of the 
{{CsvInputFormat}} which is based on {{DelimitedInputFormat}}.
In the current implementation, the file is first split into rows (without 
looking at quote characters) and then each row is parsed. This behavior is 
pretty much baked in and cannot be easily changed.

There is a [PR that uses a CSV parsing 
library|https://github.com/apache/flink/pull/4660] to scan CSV files and 
handles this case better.
However, in general row delimiters in quoted strings can only be properly 
processed, if we read CSV files as a whole, i.e., without splitting them into 
smaller chunks which are read in parallel by different tasks. 


was (Author: fhueske):
This problem cannot be solve with the current implementation of the 
{{CsvInputFormat}} which is based on {{DelimitedInputFormat}}.
In the current implementation, the file is first split into rows (without 
looking at quote characters) and then each row is parsed. This behavior is 
pretty much baked in and cannot be easily changed.

There is a [PR that uses a CSV parsing 
library|https://github.com/apache/flink/pull/4660] to scan CSV files and 
handles this case better.
However, in general row delimiters in quoted strings can only be properly 
processed, if we read CSV files as a whole, i.e., without splitting them into 
smaller chunks which are read in parallel by different tasks. 

> Newlines should be valid in quoted strings in CSV
> -------------------------------------------------
>
>                 Key: FLINK-6016
>                 URL: https://issues.apache.org/jira/browse/FLINK-6016
>             Project: Flink
>          Issue Type: Bug
>          Components: Batch Connectors and Input/Output Formats
>    Affects Versions: 1.2.0
>            Reporter: Luke Hutchison
>            Priority: Major
>
> The RFC for the CSV format specifies that newlines are valid in quoted 
> strings in CSV:
> https://tools.ietf.org/html/rfc4180
> However, when parsing a CSV file with Flink containing a newline, such as:
> {noformat}
> "3
> 4",5
> {noformat}
> you get this exception:
> {noformat}
> Line could not be parsed: '"3'
> ParserError UNTERMINATED_QUOTED_STRING 
> Expect field types: class java.lang.String, class java.lang.String 
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to