Hi Hernán,

On 06 Dec 2010, at 20:54, Hernán Morales Durand wrote:

> It seems my performance problem involves reading and parsing a "CSV" file
> 
> Elements Matrix DhbMatrix
> 53400 18274   17329
> 175960        61043   60722
> 710500        379276  385278
> 
> I will check if it's worth to implement a primitive for very fast parsing of 
> CSV files.

I think that instead of going native, it would be worthwhile to try to optimize 
in Smalltalk first (it certainly is more fun). 

I thought that this was an interesting problem so I tried writing some code 
myself, assuming the main problem is getting a CSV matrix in and out of 
Smalltalk as fast as possible. I simplified further by making the matrix square 
and containing only Numbers. I also preallocate the matrix and use the fact 
that I know the row/column sizes.

These are my results:

Size Elements Read Write
250  62500    1013 7858
500  250000   4185 31007
750  562500   9858 71434

I think this is faster, but it is hard to compare. I am still a bit puzzled as 
to why the writing is slower than the reading though.

The code is available at http://www.squeaksource.com/ADayAtTheBeach.html in the 
package 'Smalltalk-Hacking-Sven', class NumberMatrix.

This is the write loop:

writeCsvTo: stream
        1 to: size do: [ :row |
                1 to: size do: [ :column |
                        column ~= 1 ifTrue: [ stream nextPut: $, ].
                        stream print: (self at: row at: column) ].
                stream nextPut: Character lf ]

And this is the read loop:

readCsvFrom: stream
        | numberParser |
        numberParser := SqNumberParser on: stream.
        1 to: size do: [ :row |
                1 to: size do: [ :column |
                        self at: row at: column put: numberParser nextNumber.
                        column ~= size ifTrue: [ stream peekFor: $, ] ].
                row ~= size ifTrue: [ stream peekFor: Character lf ] ]

I am of course cheating a little bit, but should your CSV file be different, I 
am sure you can adapt the code (for example to deal with quoting). I am also 
advancing the stream under the SqNumberParser to avoid allocation a new one 
every time. I think this code generates little garbage.

What do you think ?

Sven

Reply via email to