Thank you for sharing your experience Ted. Do you have a link to the code of your parser? I'd like to get a look.

Currently the data flow in Commons CSV is:

1. Buffer the data in the BufferedReader
2. Accumulate data in a reusable buffer for the current token
3. Turn the token buffer into a String

I was also thinking at something similar to reduce the string copies. The token from the CSVLexer could probably contain a CharSequence instead of a String. The CharSequence would be backed by the same array for all the fields of the record. Thus if a field isn't read by the user we don't pay the cost to convert it into a String. But this prevents the reuse of the buffer, and that means more work for the GC.

Emmanuel Bourg


Le 15/03/2012 15:49, Ted Dunning a écrit :
I built a limited CSV package for parsing data in Mahout at one point.  I
doubt that it was general enough to be helpful here, but the experience
might be.

The thing that *really* made a big difference in speed was to avoid copies
and conversions to String.  To do that, I built a state machine that
operated on bytes to do the parsing from byte arrays.  The parser passed
around offsets only.  Then when converting data, I converted directly from
the original byte array into the target type.  For the most common case (in
my data) of converting to Integers, this eliminated masses of cons'ing and
because the conversion was special purpose (I assumed UTF8 encoding and
assumed that numbers could only use ASCII range digits), the conversion to
integers was particularly fast.

Overall, this made about a 20x difference in speed.  This is not 20%; the
final time was 5% of the original.

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to