I've looked at the code for matrix multiplies in R, and found that it can speeded up quite a bit, for multiplies that are really vector dot products, and for other multiplies in which the result matrix is small.
Here's my test program: u <- seq(0,1,length=1000) v <- seq(0,2,length=1000) A2 <- matrix(2.1,2,1000) A5 <- matrix(2.1,5,1000) B3 <- matrix(3.2,1000,3) A4 <- matrix(2.1,4,1000) B4 <- matrix(3.2,1000,4) print(system.time ({for (i in 1:100000) w <- u %*% v})) print(system.time ({for (i in 1:100000) w <- A5 %*% v})) print(system.time ({for (i in 1:100000) R <- A2 %*% B3})) print(system.time ({for (i in 1:100000) R <- A5 %*% B3})) print(system.time ({for (i in 1:100000) R <- A4 %*% B4})) Here are the five system.time results above using R 2.11.1: user system elapsed 1.680 0.000 1.683 user system elapsed 4.350 0.000 4.349 user system elapsed 4.820 0.000 4.818 user system elapsed 7.370 0.020 7.383 user system elapsed 7.990 0.000 7.991 and here are the results with my modified version of R 2.11.1: user system elapsed 0.270 0.000 0.271 user system elapsed 1.290 0.000 1.287 user system elapsed 1.080 0.000 1.086 user system elapsed 3.640 0.000 3.642 user system elapsed 7.950 0.000 7.953 The main issue here is again ISNAN. The R code doesn't trust the Fortran routine it calls to handle NA and NaN correctly, so it checks whether the arguments have any NAs or NaNs, and if so does the matrix multiply itself, with simple nested loops. For multiplies in which the number of elements in the result matrix is large, this check will take a negligible amount of time. But otherwise it can drastically slow things down. This is true in particular for dot products, of a 1xN times an Nx1 matrix, giving a 1x1 result. I changed the code to handle dot product specially, with no need for ISNAN checks, and otherwise to go straight to the nested loop code (bypassing Fortran and the ISNAN checks) if the result matrix has no more than 15 elements. One can see that the times are much improved above, except of course for the multiply that produces a 4x4 result, which takes the same amount of time as before, since nothing is different for that. From the results, it seems that a threshold a bit above 15 might be even better, though this will of course depend on the architecture, compiler, etc. Simon Urbaneck points out in reply to my message about speeding up "sum" and "prod" that the effects of changes like this depend on the machine architecure, compiler, etc. This is of course true. The magnitude of variation in his results (not the absolute times, but the relative times of different versions) is rather larger than one would expect, though. Perhaps his time measurements were affected by random factors like system load, or perhaps they are due to non-random but essentially arbitrary factors like the exact alignment of code generated by the C compiler, which could change with even unrelated modifications to the program. The changes that I made seem generally desirable for any architecture, or will at least make things no worse (the code expansion is trivial), apart from the possibility of such arbitrary factors. I'm not doing anything like manually unrolling loops, or replacing array indexing with pointer arithmetic, which might be seen as undesirable attempts at out-smarting the optimizer. The times above are for an Intel Linux machine with gcc version below: Using built-in specs. Target: i486-linux-gnu Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.2 --program-suffix=-4.2 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --enable-mpfr --enable-targets=all --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu --target=i486-linux-gnu Thread model: posix gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu4) I didn't change any R configuration options regarding compilation. Below is my modified version of the matprod function in src/main/array.c. Radford Neal ------------------------------------------------------------------------ static void matprod(double *x, int nrx, int ncx, double *y, int nry, int ncy, double *z) { char *transa = "N", *transb = "N"; int i, j, k; double one = 1.0, zero = 0.0; LDOUBLE sum; Rboolean do_it_here; /* NOTE: ncx must be equal to nry. */ if (nrx==1 && ncy==1) { /* Do dot product quickly. */ sum = 0.0; for (i = 0; i<ncx; i++) sum += x[i] * y[i]; z[0] = sum; } else if (nrx > 0 && ncx > 0 && nry > 0 && ncy > 0) { /* Faster to just do it here if the result is small, especially considering the NA/NaN check below. */ do_it_here = nrx*ncy <= 15; /* Don't trust the BLAS to handle NA/NaNs correctly: PR#4582 * The test is only O(n) here */ if (!do_it_here) for (i = 0; i < nrx*ncx; i++) if (ISNAN(x[i])) { do_it_here = TRUE; break; } if (!do_it_here) for (i = 0; i < nry*ncy; i++) if (ISNAN(y[i])) { do_it_here = TRUE; break; } if (do_it_here) { for (i = 0; i < nrx; i++) for (k = 0; k < ncy; k++) { sum = 0.0; for (j = 0; j < ncx; j++) sum += x[i + j * nrx] * y[j + k * nry]; z[i + k * nrx] = sum; } } else F77_CALL(dgemm)(transa, transb, &nrx, &ncy, &ncx, &one, x, &nrx, y, &nry, &zero, z, &nrx); } else /* zero-extent operations should return zeroes */ for(i = 0; i < nrx*ncy; i++) z[i] = 0; } ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel