Here is a summary of some results on a dual Opteron 252 running FC3 64-bit gcc 3.4.5 R's blas 34.83 3.45 38.56 ATLAS 36.70 3.28 40.14 ATLAS multithread 76.85 5.39 82.29 Goto 1 thread 36.17 3.44 39.76 Goto multithread 178.06 345.97 467.99 ACML 49.69 3.36 53.23
64-bit gcc 4.1.0 R's blas 34.98 3.49 38.55 32-bit gcc 3.4.5 R's blas 33.72 3.27 36.99 32-bit gcc 4.1.0 R's blas 34.62 3.25 37.93 The timings are not that repeatable, but the message seems clear that this problem does not benefit from a tuned BLAS and the overhead from multiple threads is harmful. (The gcc 4.1.0 results took fewer iterations, which skews the results in its favour.) And my 2GHz Pentium M laptop under Windows gave 39.96 3.68 44.06. Clearly the Goto BLAS has a problem here: the results are slower on a dual 252 than a dual 248 (see below). On Fri, 3 Mar 2006, Prof Brian Ripley wrote: > On Fri, 3 Mar 2006, Douglas Bates wrote: > >> I have been timing a particular model fit using lmer on several >> different computers and came up with a peculiar result - the model fit >> is considerably slower on a dual-core Athlon 64 using Goto's >> multithreaded BLAS than on a single-core processor. > > Is there a Goto BLAS tuned for that chip? I can only see one tuned for an > (unspecified) Opteron. L1 and L2 cache sizes do sometimes matter a lot > for tuned BLAS, and (according to the AMD site I just looked up) the X2 > 3800+ only has a 512Kb per core L2 cache. Opterons have a 1Mb L2 cache. > > Also, the very large system time seen in the dual-core run is typical of > what I see when pthreads is not working right, and I suggest you try a > limit of one thread (see the R-admin manual). On our dual-processor > Opteron 248 that ran in 44 secs instead of 328. > >> Here is the timing on a single-core Athlon 64 3000+ running under >> today's R-devel with version 0.995-5 of the Matrix package. >> >>> library(Matrix) >>> data(star, package = 'mlmRev') >>> system.time(fm1 <- lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), >>> star, > control = list(nit=0,grad=0,msV=1))) >> [1] 43.10 3.78 48.41 0.00 0.00 >> >> >> (If you run the timing yourself and don't want to see the iteration >> output, take the msV=1 out of the control list. I keep it in there so >> I can monitor the progress.) >> >> If I time the same model fit on a dual-core Athlon 64 X2 3800+ with >> the same version of R, BLAS and Matrix package, the timing ends up >> with something like >> >> 90 140 235 0 0 > .... > > -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel