[
https://issues.apache.org/jira/browse/FLINK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487151#comment-16487151
]
Luke Hutchison commented on FLINK-6016:
---------------------------------------
[~fhueske] reading a file in parallel is not faster for most filesystems and
most storage devices on most operating systems. In fact, for a large-latency
seek device, such as an HDD, reading from several threads in parallel will
actually increase the total read time, potentially dramatically. The only way
reading a file in parallel can be truly fast from multiple threads is if the
entire file is already cached in RAM.
I suggest simply reading the file serially, and emitting lines to a collection
that can then be read in parallel by multiple mappers.
> Newlines should be valid in quoted strings in CSV
> -------------------------------------------------
>
> Key: FLINK-6016
> URL: https://issues.apache.org/jira/browse/FLINK-6016
> Project: Flink
> Issue Type: Bug
> Components: Batch Connectors and Input/Output Formats
> Affects Versions: 1.2.0
> Reporter: Luke Hutchison
> Priority: Major
>
> The RFC for the CSV format specifies that newlines are valid in quoted
> strings in CSV:
> https://tools.ietf.org/html/rfc4180
> However, when parsing a CSV file with Flink containing a newline, such as:
> {noformat}
> "3
> 4",5
> {noformat}
> you get this exception:
> {noformat}
> Line could not be parsed: '"3'
> ParserError UNTERMINATED_QUOTED_STRING
> Expect field types: class java.lang.String, class java.lang.String
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)