Hi Sven,

Thanks for your comments, my matrix includes names, floats and
characters, and the matrix isn't squared because I need to build a
transposed matrix while reading the CSV, however I will take a look
the SqNumberParser.

In case anyone is wondering how this is done, this is the code for a
matrix with 20 rows:

First to make some predefined columns for which I have to repeat
values, so I used the Generator ported by Lukas Renggli.

matrix := Matrix rows: 29044 columns: 20.
firstCol := ( Generator on: [: g | rowCount timesRepeat: [ g yield: 1
] ] ) upToEnd.
matrix atColumn: 1 put: fCol.
"... fill remaining 5 columns ..."
index := 0.
parserResult rowsDo: [: rs | | rowInd colInd |
        matrix
                at: ( rowInd := index \\ 20 + 1 )
                at: ( colInd := index // 20 + 7 )
                put: 'test'.
                index := index + 1 ].
1 to: rowCount do: [: rIndex |
        1 to: colCount do: [: cIndex |
                outputFile nextPutAll: ( matrix at: rIndex at: cIndex ) 
asString; tab ].
                outputFile cr ].

Hope it helps someone,
Cheers

2010/12/7 Sven Van Caekenberghe <[email protected]>:
> Hi Hernán,
> On 06 Dec 2010, at 20:54, Hernán Morales Durand wrote:
>
> It seems my performance problem involves reading and parsing a "CSV" file
>
> Elements Matrix DhbMatrix
> 53400 18274 17329
> 175960 61043 60722
> 710500 379276 385278
>
> I will check if it's worth to implement a primitive for very fast parsing of
> CSV files.
>
> I think that instead of going native, it would be worthwhile to try to
> optimize in Smalltalk first (it certainly is more fun).
> I thought that this was an interesting problem so I tried writing some code
> myself, assuming the main problem is getting a CSV matrix in and out of
> Smalltalk as fast as possible. I simplified further by making the matrix
> square and containing only Numbers. I also preallocate the matrix and use
> the fact that I know the row/column sizes.
> These are my results:
> Size Elements Read Write
> 250  62500    1013 7858
> 500  250000   4185 31007
> 750  562500   9858 71434
> I think this is faster, but it is hard to compare. I am still a bit puzzled
> as to why the writing is slower than the reading though.
> The code is available at http://www.squeaksource.com/ADayAtTheBeach.html in
> the package 'Smalltalk-Hacking-Sven', class NumberMatrix.
> This is the write loop:
> writeCsvTo: stream
> 1 to: size do: [ :row |
> 1 to: size do: [ :column |
> column ~= 1 ifTrue: [ stream nextPut: $, ].
> stream print: (self at: row at: column) ].
> stream nextPut: Character lf ]
> And this is the read loop:
> readCsvFrom: stream
> | numberParser |
> numberParser := SqNumberParser on: stream.
> 1 to: size do: [ :row |
> 1 to: size do: [ :column |
> self at: row at: column put: numberParser nextNumber.
> column ~= size ifTrue: [ stream peekFor: $, ] ].
> row ~= size ifTrue: [ stream peekFor: Character lf ] ]
> I am of course cheating a little bit, but should your CSV file be different,
> I am sure you can adapt the code (for example to deal with quoting). I am
> also advancing the stream under the SqNumberParser to avoid allocation a new
> one every time. I think this code generates little garbage.
> What do you think ?
> Sven
>



-- 
Hernán Morales
Information Technology Manager,
Institute of Veterinary Genetics.
National Scientific and Technical Research Council (CONICET).
La Plata (1900), Buenos Aires, Argentina.
Telephone: +54 (0221) 421-1799.
Internal: 422
Fax: 425-7980 or 421-1799.

Reply via email to