Hi Benoit, Thanks for your suggestions! I did some metrics again with a subset of my files with fewer lines, and here are the results
Lines Milliseconds 116 59 2784 1415 63936 18675 175840 48216 534760 149812 After implementing your improvements: Lines Milliseconds 116 44 2784 1067 63936 13991 175840 37521 534760 112906 Since I can assume my CSV files doesn't include quotes, I've removed the qouted character checking subclassing a special parser for files with qoutes: Lines Milliseconds 116 37 2784 888 63936 11521 175840 31905 534760 96549 which is an acceptable improvement considering files of millions of lines. If I can make some time for implementing a primitive, I will update with more results. Best regards, 2010/12/7 Benoit St-Jean <[email protected]>: > Hi Hernan, > > I had a look at the Text Parser package and here's a little suggestion... > > In class STextParser, there seems to be place for optimization (I know, I > *hate* using this word with Smalltalk, I've always preferred simplicity over > harder-to-read-code) in one particular method, namely #nextInLine. > > A quick test in a workspace shows that removing message sends for methods > #cr and #lf (making the class variables instead) as well as using #== > instead of #= make this method 3 to 5 times faster. Since it's probably > called millions and millions of times in your case, this might help... > > So it would look like: > > nextInLine > | next | > > next := stream next. > (next == MyCr or: [next == MyLf]) > ifTrue: [stream skip: -1. > next := nil]. > ^ next > > > The original method executes in 23.45 seconds > Removing the message sends (Character cr and Character lf) and replacing > them with class vars brings that down to 10.9 seconds. > Then, replacing #= by #== brings the average time to 4.25 seconds. > > Tests were executed for 10000000 characters. > > Hope this helps! > > Keep me posted! > > > ----------------- > Benoit St-Jean > A standpoint is an intellectual horizon of radius zero. > (Albert Einstein) > > > > >> Date: Tue, 7 Dec 2010 03:10:20 -0300 >> From: [email protected] >> To: [email protected] >> Subject: Re: [Pharo-users] Fastest matrix implementation? >> >> I cannot send the CSV files because they are private data currently >> being used for research, just to give a hint I'm parsing files from >> 800,000 to 19 millions of lines. >> >> I'm using http://www.squeaksource.com/SimpleTextParser.html which is >> based in http://www.squeaksource.com/CSV.html plus some useful >> additions (for me). >> >> 2010/12/7 Benoit St-Jean <[email protected]>: >> > What are you using to read those CSV files? Do you have a file so we >> > can >> > have a look at it and possibly speed up the reading of the CSV file? >> > >> > ----------------- >> > Benoit St-Jean >> > A standpoint is an intellectual horizon of radius zero. >> > (Albert Einstein) >> > >> > >> > >> > >> >> Date: Mon, 6 Dec 2010 16:54:37 -0300 >> >> From: [email protected] >> >> To: [email protected] >> >> Subject: Re: [Pharo-users] Fastest matrix implementation? >> >> >> >> Hi Benoit, >> >> >> >> I've loaded the package but it seems the port is not complete, i.e. if >> >> you evaluate: >> >> >> >> DhbMatrix new: 10 >> >> >> >> you will get a MessageNotUnderstood: Interval>>asVector because >> >> extension methods were not ported. I uploaded to the SqueakSource a >> >> new version including extension methods and now most tests pass. >> >> >> >> Concerning the performance issues, I've narrowed my code to only >> >> measure the writing and reading of a matrix of 710500 elements, >> >> resulting in 58239 milliseconds for the native Matrix implementation >> >> and 56920 for DhbMatrix. >> >> It seems my performance problem involves reading and parsing a "CSV" >> >> file >> >> >> >> Elements Matrix DhbMatrix >> >> 53400 18274 17329 >> >> 175960 61043 60722 >> >> 710500 379276 385278 >> >> >> >> I will check if it's worth to implement a primitive for very fast >> >> parsing of CSV files. >> >> Cheers, >> >> >> >> 2010/12/5 Benoit St-Jean <[email protected]>: >> >> > Have you tried the matrix implementation in the numerical package >> >> > from >> >> > Didier H. Besset? >> >> > >> >> > http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC >> >> > >> >> > >> >> > >> >> > >> >> > ----------------- >> >> > Benoit St-Jean >> >> > A standpoint is an intellectual horizon of radius zero. >> >> > (Albert Einstein) >> >> > >> >> > >> >> > >> >> > >> >> >> Date: Sun, 5 Dec 2010 17:33:17 -0300 >> >> >> From: [email protected] >> >> >> To: [email protected] >> >> >> Subject: [Pharo-users] Fastest matrix implementation? >> >> >> >> >> >> Hi list >> >> >> >> >> >> In the context of a scientific project here we are building big >> >> >> matrices for later processing, mostly exporting to custom file >> >> >> formats >> >> >> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of >> >> >> our scripts in both Pharo 1.1 (not CogVM) with the corresponding >> >> >> Python 2.6 implementation (without PyPy), and the performance in >> >> >> Python was superior, about 8x faster than ST. >> >> >> So I wonder if anyone knows the fastest (or a faster) implementation >> >> >> of Matrix than the included by default in Collections? >> >> >> >> >> >> Cheers, >> >> >> >> >> >> >> -- >> >> Hernán Morales >> >> Information Technology Manager, >> >> Institute of Veterinary Genetics. >> >> National Scientific and Technical Research Council (CONICET). >> >> La Plata (1900), Buenos Aires, Argentina. >> >> Telephone: +54 (0221) 421-1799. >> >> Internal: 422 >> >> Fax: 425-7980 or 421-1799. >> >> >> > >> >> >> >> -- >> Hernán Morales >> Information Technology Manager, >> Institute of Veterinary Genetics. >> National Scientific and Technical Research Council (CONICET). >> La Plata (1900), Buenos Aires, Argentina. >> Telephone: +54 (0221) 421-1799. >> Internal: 422 >> Fax: 425-7980 or 421-1799. >> >
