On 3/11/06, Prof Brian Ripley <[EMAIL PROTECTED]> wrote: > Here is a summary of some results on a dual Opteron 252 running FC3 > > 64-bit gcc 3.4.5 > R's blas 34.83 3.45 38.56 > ATLAS 36.70 3.28 40.14 > ATLAS multithread 76.85 5.39 82.29 > Goto 1 thread 36.17 3.44 39.76 > Goto multithread 178.06 345.97 467.99 > ACML 49.69 3.36 53.23 > > 64-bit gcc 4.1.0 > R's blas 34.98 3.49 38.55 > 32-bit gcc 3.4.5 > R's blas 33.72 3.27 36.99 > 32-bit gcc 4.1.0 > R's blas 34.62 3.25 37.93 > > The timings are not that repeatable, but the message seems clear that > this problem does not benefit from a tuned BLAS and the overhead from > multiple threads is harmful. (The gcc 4.1.0 results took fewer > iterations, which skews the results in its favour.) > > And my 2GHz Pentium M laptop under Windows gave 39.96 3.68 44.06. > > Clearly the Goto BLAS has a problem here: the results are slower on a dual > 252 than a dual 248 (see below).
Thanks for the information on the timings. It happens that this message ended up in my R-help folder and I only got around to reading that folder today. Is it ok with you if I forward this message to Simon Urbanek? I am having similar difficulties in the timing with R on a dual-core Intel MacBook. > > > On Fri, 3 Mar 2006, Prof Brian Ripley wrote: > > > On Fri, 3 Mar 2006, Douglas Bates wrote: > > > >> I have been timing a particular model fit using lmer on several > >> different computers and came up with a peculiar result - the model fit > >> is considerably slower on a dual-core Athlon 64 using Goto's > >> multithreaded BLAS than on a single-core processor. > > > > Is there a Goto BLAS tuned for that chip? I can only see one tuned for an > > (unspecified) Opteron. L1 and L2 cache sizes do sometimes matter a lot > > for tuned BLAS, and (according to the AMD site I just looked up) the X2 > > 3800+ only has a 512Kb per core L2 cache. Opterons have a 1Mb L2 cache. > > > > Also, the very large system time seen in the dual-core run is typical of > > what I see when pthreads is not working right, and I suggest you try a > > limit of one thread (see the R-admin manual). On our dual-processor > > Opteron 248 that ran in 44 secs instead of 328. > > > >> Here is the timing on a single-core Athlon 64 3000+ running under > >> today's R-devel with version 0.995-5 of the Matrix package. > >> > >>> library(Matrix) > >>> data(star, package = 'mlmRev') > >>> system.time(fm1 <- lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), > >>> star, > > control = list(nit=0,grad=0,msV=1))) > >> [1] 43.10 3.78 48.41 0.00 0.00 > >> > >> > >> (If you run the timing yourself and don't want to see the iteration > >> output, take the msV=1 out of the control list. I keep it in there so > >> I can monitor the progress.) > >> > >> If I time the same model fit on a dual-core Athlon 64 X2 3800+ with > >> the same version of R, BLAS and Matrix package, the timing ends up > >> with something like > >> > >> 90 140 235 0 0 > > .... > > > > > > -- > Brian D. Ripley, [EMAIL PROTECTED] > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272866 (PA) > Oxford OX1 3TG, UK Fax: +44 1865 272595 > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel