I don't think, that multi-char field delimiters would cause a performance problem. The data needs to be parsed anyway. Only in cases where the delimiter has a prefix that occurs often in the regular data, it could have a major impact.
Fabian 2014-10-15 16:07 GMT+02:00 Martin Neumann <[email protected]>: > Would changing it cost performance? > If not I thing it would be a good change to make since it allows to (ab)use > the csv reader to load structured Text files (for example by putting > Keywords as delimiter). > > Being able to put a regular expression there would be even nicer but maybe > it should end up in its own InputFormat then. > > cheers Martin > > On Wed, Oct 15, 2014 at 3:47 PM, Stephan Ewen <[email protected]> wrote: > > > Hi! > > > > The reason is the current way the csv parsers work. They are pushed into > > the byte stream parsing and are restricted to recognize one char > > delimiters. It is possible to change that, but would be a bit of work. > > > > Stephan > > > > On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <[email protected]> > > wrote: > > > > > Hej, > > > > > > A lot of my inputs are csv files so I use the CsvInputFormat a lot. > What > > I > > > find kind of odd that the Line delimiter is a String but the Field > > > delimiter is a Character. > > > > > > *see:* new CsvInputFormat<Tuple2<String,String>>(new > > > Path(pVecPath),"\n",'\t',String.class,String.class) > > > > > > Is there a reason for this? I'm currently working with a file that has > a > > > more complex field delimiter so I had to write a mapper to read from > > > StringInputFormat. > > > > > > cheers Martin > > > > > >
