Re: CsvInputFormat delimiter fields

Fabian Hueske Thu, 16 Oct 2014 02:29:41 -0700

I don't think, that multi-char field delimiters would cause a performance
problem. The data needs to be parsed anyway.
Only in cases where the delimiter has a prefix that occurs often in the
regular data, it could have a major impact.


Fabian

2014-10-15 16:07 GMT+02:00 Martin Neumann <[email protected]>:

> Would changing it cost performance?
> If not I thing it would be a good change to make since it allows to (ab)use
> the csv reader to load structured Text files (for example by putting
> Keywords as delimiter).
>
> Being able to put a regular expression there would be even nicer but maybe
> it should end up in its own InputFormat then.
>
> cheers Martin
>
> On Wed, Oct 15, 2014 at 3:47 PM, Stephan Ewen <[email protected]> wrote:
>
> > Hi!
> >
> > The reason is the current way the csv parsers work. They are pushed into
> > the byte stream parsing and are restricted to recognize one char
> > delimiters. It is possible to change that, but would be a bit of work.
> >
> > Stephan
> >
> > On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <[email protected]>
> > wrote:
> >
> > > Hej,
> > >
> > > A lot of my inputs are csv files so I use the CsvInputFormat a lot.
> What
> > I
> > > find kind of odd that the Line delimiter is a String but the Field
> > > delimiter is a Character.
> > >
> > > *see:* new CsvInputFormat<Tuple2<String,String>>(new
> > > Path(pVecPath),"\n",'\t',String.class,String.class)
> > >
> > > Is there a reason for this? I'm currently working with a file that has
> a
> > > more complex field delimiter so I had to write a mapper to read from
> > > StringInputFormat.
> > >
> > > cheers Martin
> > >
> >
>

Re: CsvInputFormat delimiter fields

Reply via email to