[
https://issues.apache.org/jira/browse/FLINK-10684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813075#comment-16813075
]
Fabian Hueske commented on FLINK-10684:
---------------------------------------
Thanks for the summary [~xccui]!
I agree, P1 and P2 should be (ideally) solved with a generic approach that also
works for other formats than CSV.
P3 could be addressed by FLINK-7050 already, IMO.
There is an open PR that would need (IIRC) a bit of refactoring and
improvements. The challenge is here to support parallel reads by dividing a
file into multiple splits (a line break might not indicate a new record but be
enclosed in a string field).
This would be a nice addition to the CSV SerializationSchema which reads
standard-compliant CSV from Kafka.
> Improve the CSV reading process
> -------------------------------
>
> Key: FLINK-10684
> URL: https://issues.apache.org/jira/browse/FLINK-10684
> Project: Flink
> Issue Type: Improvement
> Components: API / DataSet
> Reporter: Xingcan Cui
> Priority: Major
>
> CSV is one of the most commonly used file formats in data wrangling. To load
> records from CSV files, Flink has provided the basic {{CsvInputFormat}}, as
> well as some variants (e.g., {{RowCsvInputFormat}} and
> {{PojoCsvInputFormat}}). However, it seems that the reading process can be
> improved. For example, we could add a built-in util to automatically infer
> schemas from CSV headers and samples of data. Also, the current bad record
> handling method can be improved by somehow keeping the invalid lines (and
> even the reasons for failed parsing), instead of logging the total number
> only.
> This is an umbrella issue for all the improvements and bug fixes for the CSV
> reading process.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)