Hi Benoit,

Thanks for your suggestions! I did some metrics again with a subset of
my files with fewer lines, and here are the results

Lines Milliseconds
116             59
2784    1415
63936   18675
175840  48216
534760  149812

After implementing your improvements:
Lines Milliseconds
116             44
2784    1067
63936   13991
175840  37521
534760  112906

Since I can assume my CSV files doesn't include quotes, I've removed
the qouted character checking subclassing a special parser for files
with qoutes:
Lines Milliseconds
116             37
2784    888
63936   11521
175840  31905
534760  96549

which is an acceptable improvement considering files of millions of
lines. If I can make some time for implementing a primitive, I will
update with more results.
Best regards,

2010/12/7 Benoit St-Jean <[email protected]>:
> Hi Hernan,
>
> I had a look at the Text Parser package and here's a little suggestion...
>
> In class STextParser, there seems to be place for optimization (I know, I
> *hate* using this word with Smalltalk, I've always preferred simplicity over
> harder-to-read-code) in one particular method, namely #nextInLine.
>
> A quick test in a workspace shows that removing message sends for methods
> #cr and #lf (making the class variables instead) as well as using #==
> instead of #= make this method 3 to 5 times faster.  Since it's probably
> called millions and millions of times in your case, this might help...
>
> So it would look like:
>
> nextInLine
>     | next |
>
>     next := stream next.
>     (next == MyCr or: [next == MyLf])
>         ifTrue:    [stream skip: -1.
>                 next := nil].
>     ^ next
>
>
> The original method executes in 23.45 seconds
> Removing the message sends (Character cr and Character lf) and replacing
> them with class vars brings that down to 10.9 seconds.
> Then, replacing #= by #== brings the average time to 4.25 seconds.
>
> Tests were executed for 10000000 characters.
>
> Hope this helps!
>
> Keep me posted!
>
>
> -----------------
> Benoit St-Jean
> A standpoint is an intellectual horizon of radius zero.
> (Albert Einstein)
>
>
>
>
>> Date: Tue, 7 Dec 2010 03:10:20 -0300
>> From: [email protected]
>> To: [email protected]
>> Subject: Re: [Pharo-users] Fastest matrix implementation?
>>
>> I cannot send the CSV files because they are private data currently
>> being used for research, just to give a hint I'm parsing files from
>> 800,000 to 19 millions of lines.
>>
>> I'm using http://www.squeaksource.com/SimpleTextParser.html which is
>> based in http://www.squeaksource.com/CSV.html plus some useful
>> additions (for me).
>>
>> 2010/12/7 Benoit St-Jean <[email protected]>:
>> > What are you using to read those CSV files?  Do you have a file so we
>> > can
>> > have a look at it and possibly speed up the reading of the CSV file?
>> >
>> > -----------------
>> > Benoit St-Jean
>> > A standpoint is an intellectual horizon of radius zero.
>> > (Albert Einstein)
>> >
>> >
>> >
>> >
>> >> Date: Mon, 6 Dec 2010 16:54:37 -0300
>> >> From: [email protected]
>> >> To: [email protected]
>> >> Subject: Re: [Pharo-users] Fastest matrix implementation?
>> >>
>> >> Hi Benoit,
>> >>
>> >> I've loaded the package but it seems the port is not complete, i.e. if
>> >> you evaluate:
>> >>
>> >> DhbMatrix new: 10
>> >>
>> >> you will get a MessageNotUnderstood: Interval>>asVector because
>> >> extension methods were not ported. I uploaded to the SqueakSource a
>> >> new version including extension methods and now most tests pass.
>> >>
>> >> Concerning the performance issues, I've narrowed my code to only
>> >> measure the writing and reading of a matrix of 710500 elements,
>> >> resulting in 58239 milliseconds for the native Matrix implementation
>> >> and 56920 for DhbMatrix.
>> >> It seems my performance problem involves reading and parsing a "CSV"
>> >> file
>> >>
>> >> Elements Matrix DhbMatrix
>> >> 53400 18274 17329
>> >> 175960 61043 60722
>> >> 710500 379276 385278
>> >>
>> >> I will check if it's worth to implement a primitive for very fast
>> >> parsing of CSV files.
>> >> Cheers,
>> >>
>> >> 2010/12/5 Benoit St-Jean <[email protected]>:
>> >> > Have you tried the matrix implementation in the numerical package
>> >> > from
>> >> > Didier H. Besset?
>> >> >
>> >> > http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > -----------------
>> >> > Benoit St-Jean
>> >> > A standpoint is an intellectual horizon of radius zero.
>> >> > (Albert Einstein)
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >> Date: Sun, 5 Dec 2010 17:33:17 -0300
>> >> >> From: [email protected]
>> >> >> To: [email protected]
>> >> >> Subject: [Pharo-users] Fastest matrix implementation?
>> >> >>
>> >> >> Hi list
>> >> >>
>> >> >> In the context of a scientific project here we are building big
>> >> >> matrices for later processing, mostly exporting to custom file
>> >> >> formats
>> >> >> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of
>> >> >> our scripts in both Pharo 1.1 (not CogVM) with the corresponding
>> >> >> Python 2.6 implementation (without PyPy), and the performance in
>> >> >> Python was superior, about 8x faster than ST.
>> >> >> So I wonder if anyone knows the fastest (or a faster) implementation
>> >> >> of Matrix than the included by default in Collections?
>> >> >>
>> >> >> Cheers,
>> >> >>
>> >>
>> >> --
>> >> Hernán Morales
>> >> Information Technology Manager,
>> >> Institute of Veterinary Genetics.
>> >> National Scientific and Technical Research Council (CONICET).
>> >> La Plata (1900), Buenos Aires, Argentina.
>> >> Telephone: +54 (0221) 421-1799.
>> >> Internal: 422
>> >> Fax: 425-7980 or 421-1799.
>> >>
>> >
>>
>>
>>
>> --
>> Hernán Morales
>> Information Technology Manager,
>> Institute of Veterinary Genetics.
>> National Scientific and Technical Research Council (CONICET).
>> La Plata (1900), Buenos Aires, Argentina.
>> Telephone: +54 (0221) 421-1799.
>> Internal: 422
>> Fax: 425-7980 or 421-1799.
>>
>

Reply via email to