Re: Flink CSV parsing

2017-03-11 Thread Alexander Alexandrov
FYI, I recently revisited state-of-the-art CSV parsing libraries for Emma.

I think this blog post might be useful

https://github.com/uniVocity/csv-parsers-comparison

The uniVocity parsers library seems to be dominating the benchmarks and is
feature complete.

As far as I can tell at the moment uniVocity is also the only library
backing Spark's DataFrame / Dataset CSV support, which for some time
supported multiple parsing library backends.

Regards,
Alexander

On Fri, Mar 10, 2017 at 11:17 PM Flavio Pompermaier 
wrote:

> If you already have an idea on how to proceed maybe I can try to take care
> of issue a PR using commons-csv or whatever library you prefer
>
> On 10 Mar 2017 22:07, "Fabian Hueske"  wrote:
>
> Hi Flavio,
>
> Flink's CsvInputFormat was originally meant to be an efficient way to parse
> structured text files and dates back to the very early days of the project
> (probably 2011 or so).
> It was never meant to be compliant with the RFC specification and initially
> didn't support many features like quoting, quote escaping, etc. Some of
> these were later added but others not.
>
> I agree that the requirements for the CsvInputFormat have changed as more
> people are using the project and that a standard compliant parser would be
> desirable.
> We could definitely look into using an existing library for the parsing,
> but it would still need to be integrated with the way that Flink's
> InputFormats work. For instance, you're approach isn't standard compliant
> either, because TextInputFormat is not aware of quotes and would break
> records with quoted record delimiters (FLINK-6016 [1]).
>
> I would be OK with having a less efficient format which is not based on the
> current implementation but which is standard compliant.
> IMO that would be a very useful contribution.
>
> Best, Fabian
>
> [1] https://issues.apache.org/jira/browse/FLINK-6016
>
>
>
>
>
> 2017-03-10 11:28 GMT+01:00 Flavio Pompermaier :
>
> > Hi to all,
> > I want to discuss with the dev group something about CSV parsing.
> > Since I started using Flink with CSVs I always faced some little problem
> > here and there and the new tickets about the CSV parsing seems to confirm
> > that this part is still problematic.
> > In my production jobs I gave up using Flink CSV parsing in favour of
> apace
> > commons-csv and it works great. It's perfectly configurable ans robust.
> > A working example is available at [1].
> >
> > Thus, why not to use that library directly and contribute back (if
> needed)
> > to another apache library if improvements are required to speed up the
> > parsing? Have you ever tried to compare the performances of the 2
> parsers?
> >
> > Best,
> > Flavio
> >
> > [1]
> > https://github.com/okkam-it/flink-examples/blob/master/
> > src/main/java/it/okkam/datalinks/batch/flink/datasourcemanager/importers/
> > Csv2RowExample.java
> >
>


Re: Flink CSV parsing

2017-03-10 Thread Flavio Pompermaier
If you already have an idea on how to proceed maybe I can try to take care
of issue a PR using commons-csv or whatever library you prefer

On 10 Mar 2017 22:07, "Fabian Hueske"  wrote:

Hi Flavio,

Flink's CsvInputFormat was originally meant to be an efficient way to parse
structured text files and dates back to the very early days of the project
(probably 2011 or so).
It was never meant to be compliant with the RFC specification and initially
didn't support many features like quoting, quote escaping, etc. Some of
these were later added but others not.

I agree that the requirements for the CsvInputFormat have changed as more
people are using the project and that a standard compliant parser would be
desirable.
We could definitely look into using an existing library for the parsing,
but it would still need to be integrated with the way that Flink's
InputFormats work. For instance, you're approach isn't standard compliant
either, because TextInputFormat is not aware of quotes and would break
records with quoted record delimiters (FLINK-6016 [1]).

I would be OK with having a less efficient format which is not based on the
current implementation but which is standard compliant.
IMO that would be a very useful contribution.

Best, Fabian

[1] https://issues.apache.org/jira/browse/FLINK-6016





2017-03-10 11:28 GMT+01:00 Flavio Pompermaier :

> Hi to all,
> I want to discuss with the dev group something about CSV parsing.
> Since I started using Flink with CSVs I always faced some little problem
> here and there and the new tickets about the CSV parsing seems to confirm
> that this part is still problematic.
> In my production jobs I gave up using Flink CSV parsing in favour of
apace
> commons-csv and it works great. It's perfectly configurable ans robust.
> A working example is available at [1].
>
> Thus, why not to use that library directly and contribute back (if needed)
> to another apache library if improvements are required to speed up the
> parsing? Have you ever tried to compare the performances of the 2 parsers?
>
> Best,
> Flavio
>
> [1]
> https://github.com/okkam-it/flink-examples/blob/master/
> src/main/java/it/okkam/datalinks/batch/flink/datasourcemanager/importers/
> Csv2RowExample.java
>


Re: Flink CSV parsing

2017-03-10 Thread Fabian Hueske
Hi Flavio,

Flink's CsvInputFormat was originally meant to be an efficient way to parse
structured text files and dates back to the very early days of the project
(probably 2011 or so).
It was never meant to be compliant with the RFC specification and initially
didn't support many features like quoting, quote escaping, etc. Some of
these were later added but others not.

I agree that the requirements for the CsvInputFormat have changed as more
people are using the project and that a standard compliant parser would be
desirable.
We could definitely look into using an existing library for the parsing,
but it would still need to be integrated with the way that Flink's
InputFormats work. For instance, you're approach isn't standard compliant
either, because TextInputFormat is not aware of quotes and would break
records with quoted record delimiters (FLINK-6016 [1]).

I would be OK with having a less efficient format which is not based on the
current implementation but which is standard compliant.
IMO that would be a very useful contribution.

Best, Fabian

[1] https://issues.apache.org/jira/browse/FLINK-6016





2017-03-10 11:28 GMT+01:00 Flavio Pompermaier :

> Hi to all,
> I want to discuss with the dev group something about CSV parsing.
> Since I started using Flink with CSVs I always faced some little problem
> here and there and the new tickets about the CSV parsing seems to confirm
> that this part is still problematic.
> In my production jobs I gave up using Flink CSV parsing in favour of  apace
> commons-csv and it works great. It's perfectly configurable ans robust.
> A working example is available at [1].
>
> Thus, why not to use that library directly and contribute back (if needed)
> to another apache library if improvements are required to speed up the
> parsing? Have you ever tried to compare the performances of the 2 parsers?
>
> Best,
> Flavio
>
> [1]
> https://github.com/okkam-it/flink-examples/blob/master/
> src/main/java/it/okkam/datalinks/batch/flink/datasourcemanager/importers/
> Csv2RowExample.java
>