[
https://issues.apache.org/jira/browse/FLINK-10684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812514#comment-16812514
]
Xingcan Cui commented on FLINK-10684:
-------------------------------------
Hi all, thanks for your attention. I really encountered some problems when I
tried to read some CSV files for data wrangling, and that's why this ticket was
filed.
As shown in the description, the problems come from different aspects:
P1. Lack of schema inference.
P2. Weak error handling.
P3. Other implicit bugs (or standard incompatible mentioned by [~fhueske]).
Since the basic {{CSVInputFormat}} are used for both streaming and batch
environments, some solutions to these problems may be tricky. For instance, to
automatically infer schemas, we need to introduce a new mechanism, stream
sampling, from which I believe some other processes such as automatic
parallelism tuning and stream SQL optimization will also benefit.
To solve these problems on my own project, I applied some workarounds which are
not general enough. Although I did have some general ideas (e.g., using side
output for bad records), considering that the Flink project has been adopting
some major changes recently, maybe it's better to propose them after then.
All in all, personally I don't think it's a good time to concentrate on this
issue because none of the solutions are trivial. What do you think?
> Improve the CSV reading process
> -------------------------------
>
> Key: FLINK-10684
> URL: https://issues.apache.org/jira/browse/FLINK-10684
> Project: Flink
> Issue Type: Improvement
> Components: API / DataSet
> Reporter: Xingcan Cui
> Priority: Major
>
> CSV is one of the most commonly used file formats in data wrangling. To load
> records from CSV files, Flink has provided the basic {{CsvInputFormat}}, as
> well as some variants (e.g., {{RowCsvInputFormat}} and
> {{PojoCsvInputFormat}}). However, it seems that the reading process can be
> improved. For example, we could add a built-in util to automatically infer
> schemas from CSV headers and samples of data. Also, the current bad record
> handling method can be improved by somehow keeping the invalid lines (and
> even the reasons for failed parsing), instead of logging the total number
> only.
> This is an umbrella issue for all the improvements and bug fixes for the CSV
> reading process.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)