See the Line and FastLine classes in org.apache.mahout.classifier.sgd.SimpleCsvExamples in the Mahout Examples module.
You can see an older version of mahout here. This class hasn't changed in forever. https://github.com/tdunning/mahout/blob/debian-package/examples/src/main/java/org/apache/mahout/classifier/sgd/SimpleCsvExamples.java On Thu, Mar 15, 2012 at 3:11 PM, Emmanuel Bourg <ebo...@apache.org> wrote: > Thank you for sharing your experience Ted. Do you have a link to the code > of your parser? I'd like to get a look. > > Currently the data flow in Commons CSV is: > > 1. Buffer the data in the BufferedReader > 2. Accumulate data in a reusable buffer for the current token > 3. Turn the token buffer into a String > > I was also thinking at something similar to reduce the string copies. The > token from the CSVLexer could probably contain a CharSequence instead of a > String. The CharSequence would be backed by the same array for all the > fields of the record. Thus if a field isn't read by the user we don't pay > the cost to convert it into a String. But this prevents the reuse of the > buffer, and that means more work for the GC. > > Emmanuel Bourg > > > Le 15/03/2012 15:49, Ted Dunning a écrit : > >> I built a limited CSV package for parsing data in Mahout at one point. I >> doubt that it was general enough to be helpful here, but the experience >> might be. >> >> The thing that *really* made a big difference in speed was to avoid copies >> and conversions to String. To do that, I built a state machine that >> operated on bytes to do the parsing from byte arrays. The parser passed >> around offsets only. Then when converting data, I converted directly from >> the original byte array into the target type. For the most common case >> (in >> my data) of converting to Integers, this eliminated masses of cons'ing and >> because the conversion was special purpose (I assumed UTF8 encoding and >> assumed that numbers could only use ASCII range digits), the conversion to >> integers was particularly fast. >> >> Overall, this made about a 20x difference in speed. This is not 20%; the >> final time was 5% of the original. >> > >