[jira] [Commented] (FLINK-10684) Improve the CSV reading process

Xingcan Cui (JIRA) Mon, 08 Apr 2019 08:39:11 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-10684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812514#comment-16812514
 ]


Xingcan Cui commented on FLINK-10684:
-------------------------------------

Hi all, thanks for your attention. I really encountered some problems when I 
tried to read some CSV files for data wrangling, and that's why this ticket was 
filed.

As shown in the description, the problems come from different aspects:
 P1. Lack of schema inference.
 P2. Weak error handling.
 P3. Other implicit bugs (or standard incompatible mentioned by [~fhueske]).

Since the basic {{CSVInputFormat}} are used for both streaming and batch 
environments, some solutions to these problems may be tricky. For instance, to 
automatically infer schemas, we need to introduce a new mechanism, stream 
sampling, from which I believe some other processes such as automatic 
parallelism tuning and stream SQL optimization will also benefit.

To solve these problems on my own project, I applied some workarounds which are 
not general enough. Although I did have some general ideas (e.g., using side 
output for bad records), considering that the Flink project has been adopting 
some major changes recently, maybe it's better to propose them after then.

All in all, personally I don't think it's a good time to concentrate on this 
issue because none of the solutions are trivial. What do you think?

> Improve the CSV reading process
> -------------------------------
>
>                 Key: FLINK-10684
>                 URL: https://issues.apache.org/jira/browse/FLINK-10684
>             Project: Flink
>          Issue Type: Improvement
>          Components: API / DataSet
>            Reporter: Xingcan Cui
>            Priority: Major
>
> CSV is one of the most commonly used file formats in data wrangling. To load 
> records from CSV files, Flink has provided the basic {{CsvInputFormat}}, as 
> well as some variants (e.g., {{RowCsvInputFormat}} and 
> {{PojoCsvInputFormat}}). However, it seems that the reading process can be 
> improved. For example, we could add a built-in util to automatically infer 
> schemas from CSV headers and samples of data. Also, the current bad record 
> handling method can be improved by somehow keeping the invalid lines (and 
> even the reasons for failed parsing), instead of logging the total number 
> only.
> This is an umbrella issue for all the improvements and bug fixes for the CSV 
> reading process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-10684) Improve the CSV reading process

Reply via email to