Hi everyone,

For one of our applications using MetaModel we have a customer with
quite large files (100+ M records per file) and reading through them
takes quite some time, although the CSV module of MetaModel is known
by us to be one of the fastest modules.

But these particular files (and probably many others) have a
characteristic that I see we could utilize to make an optimization:
They don't allow values that span multiple lines. For instance
consider:

name,company
Kasper Sørensen, Human Inference
Ankit Kumar, Human Inference

This is a rather normal CSV layout. But our CSV parser also allows
multiline values (if quoted), like this:

"name","company"
"Kasper Sørensen","Human
Inference"
"Ankit Kumar","Human Inference"

Now the optimization I had in mind is to delay the actual parsing of
lines until the point where a value is needed. But this wont work with
multiline values since we wouldn't know if we should reserve only a
single line or multiple lines for the delayed/lazy CSV parser. So
therefore the module is slowed down by a blocking CSV parsing
operation for each row.

But if we add a flag to the user that he only expects/accepts
single-line values, then we can actually simply read through the file
with something like a BufferedReader and then return Row objects that
encapsulate the raw String line. The parsing of this line is then
delayed and can potentially be made multithreaded.

I made a quick prototype patch [1] (still a few improvements to be
made) of this idea and my quick'n'dirty tests showed up to ~ 65%
performance increase in a multithreaded consumer environment!

I did three runs before and after the improvements on a 30k record
file. The results are number of milliseconds used for reading through
all the values of the file:

        // results with old impl: [13908, 13827, 14577]. Total= 42312

        // results with new impl: [8567, 8965, 8154]. Total= 25686

The test that I ran is the class called 'CsvBigFileMemoryTest.java'.

What do you guys think? Is it feasable to make a optimization like
this for a specific type of CSV file?

[1] https://gist.github.com/kaspersorensen/6087230

Reply via email to