On Sep 4, 2012, at 8:07 AM, "Matthew Dowle" <mdo...@mdowle.plus.com> wrote:
> >> Actually, my apologies, I was assuming that your example was based on the >> SO question while it is not at all (the code is not involved in that test >> case). Reversing the order does indeed cause a delay. Switching to a >> single index doesn't seem to have any impact. R-devel has the faster >> version now (which now also works with large vectors). >> >> Cheers, >> Simon > > I was intrigued why the compiler doesn't swap the loops when you thought > it should, though. You're not usually wrong! From GCC's documentation (end > of last paragraph is the most significant) : > > ==== > > -floop-interchange > Perform loop interchange transformations on loops. Interchanging two > nested loops switches the inner and outer loops. For example, given a loop > like: > DO J = 1, M > DO I = 1, N > A(J, I) = A(J, I) * C > ENDDO > ENDDO > > loop interchange transforms the loop as if it were written: > > DO I = 1, N > DO J = 1, M > A(J, I) = A(J, I) * C > ENDDO > ENDDO > > which can be beneficial when N is larger than the caches, because in > Fortran, the elements of an array are stored in memory contiguously by > column, and the original loop iterates over rows, potentially creating at > each access a cache miss. This optimization applies to all the languages > supported by GCC and is not limited to Fortran. To use this code > transformation, GCC has to be configured with --with-ppl and --with-cloog > to enable the Graphite loop transformation infrastructure. > > ==== > > Could R build scripts be configured to set these gcc flags to turn on > "Graphite", then? I guess one downside could be the time to compile. > The is something odd happening - when I use stand-alone code it works: $ gcc -o t2 -O3 t2.c $ time ./t2 0x7fbae3b4f010 real 0m1.045s user 0m0.784s sys 0m0.260s $ gcc -o t2 -floop-interchange -O3 t2.c $ time ./t2 0x7f4e516f2010 real 0m0.418s user 0m0.044s sys 0m0.372s However, when I split off the loop into a parametrized function it doesn't: $ gcc -floop-interchange -O3 t.c tt.c -o t && ./t 0x7fdd37cca010 loop time = 1772.085ms $ gcc -O3 t.c tt.c -o t && ./t 0x7f3aa8777010 loop time = 1763.888ms For comparison , manually swapping i and j: $ gcc -floop-interchange -O3 t.c tt.c -o t && ./t 0x7feecd4c9010 loop time = 451.744ms For the same reason, it doesn't work for the old R code. I wonder what's happening there - I guess the optimizer is not smart enough to realize the coverage is the entire m*n span despite the fact that m and n are parameters ... But it's certainly something it should optimize as it did when used directly in a function... Odd ... Cheers, Simon -- PS: Note this has nothing to do with "R builds scripts" - it is user's responsibility to add optimization flags since they are very much compiler and architecture-dependent. Also optimizations are occasionally known to backfire, so the default is always more conservative. FWIW this is my latest incarnation of configuring optimized R on E5 machines (minus BLAS and prefix flags which will vary by system): '--enable-lto' '--enable-R-shlib' 'CFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' 'CXXFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' 'FFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' 'FCFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' -- test code (test compiled with gcc (Ubuntu/Linaro 4.7.1-7ubuntu1) 4.7.1 20120814 (prerelease)) t.c: void foo(int *x, const unsigned int m, const unsigned int n) { int i, j; for (i = 0; i < m; i++) for(j = 0; j < n; j++) x[i + j * m] = 1; } tt.c: #include <sys/time.h> #include <stdlib.h> #include <stdio.h> void foo(int *, int, int); static double ts() { struct timeval tv; gettimeofday(&tv, 0); return (double)tv.tv_sec + ((double) tv.tv_usec) / 1000000.0; } int main() { int m = 10000, n = 10000; int *x = (int*) malloc(m * n * sizeof(int)); double a = ts(), b; foo(x, m , n); b = ts(); printf("%p\nloop time = %.3fms\n", x, (b - a) * 1000); return 0; } t2.c: #include <stdio.h> #include <stdlib.h> int *foo() { int m = 10000, n = 10000; int *x = (int*) malloc(m * n * sizeof(int)); int i, j; for (i = 0; i < m; i++) for(j = 0; j < n; j++) x[i + j * m] = 1; return x; } int main() { printf("%p", foo()); return 0; } > Matthew > > >> >> On Sep 2, 2012, at 10:32 PM, Simon Urbanek wrote: >> >>> On Sep 2, 2012, at 10:04 PM, Matthew Dowle wrote: >>> >>>> >>>> In do_matrix in src/array.c there is a type switch containing : >>>> >>>> case LGLSXP : >>>> for (i = 0; i < nr; i++) >>>> for (j = 0; j < nc; j++) >>>> LOGICAL(ans)[i + j * NR] = NA_LOGICAL; >>>> >>>> That seems page inefficient, iiuc. Think it should be : >>>> >>>> case LGLSXP : >>>> for (j = 0; j < nc; j++) >>>> for (i = 0; i < nr; i++) >>>> LOGICAL(ans)[i + j * NR] = NA_LOGICAL; >>>> >>>> or more simply : >>>> >>>> case LGLSXP : >>>> for (i = 0; i < nc*nr; i++) >>>> LOGICAL(ans)[i] = NA_LOGICAL; >>>> >>>> ( with some fine tuning required since NR is type R_xlen_t whilst i, nc >>>> and nr are type int ). >>>> >>>> Same goes for all the other types in that switch. >>>> >>>> This came up on Stack Overflow here : >>>> http://stackoverflow.com/questions/12220128/reason-for-faster-matrix-allocation-in-r >>>> >>> >>> That is completely irrelevant - modern compilers will optimize the loops >>> accordingly and there is no difference in speed. If you don't believe >>> it, run benchmarks ;) >>> >>> original >>>> microbenchmark(matrix(nrow=10000, ncol=9999), times=10) >>> Unit: milliseconds >>> expr min lq median uq >>> max >>> 1 matrix(nrow = 10000, ncol = 9999) 940.5519 940.6644 941.136 954.7196 >>> 1409.901 >>> >>> >>> swapped >>>> microbenchmark(matrix(nrow=10000, ncol=9999), times=10) >>> Unit: milliseconds >>> expr min lq median uq >>> max >>> 1 matrix(nrow = 10000, ncol = 9999) 949.9638 950.6642 952.7497 961.001 >>> 1246.573 >>> >>> Cheers, >>> Simon >>> >>> >>>> Matthew >>>> >>>> ______________________________________________ >>>> R-devel@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>> >>>> >>> >>> ______________________________________________ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >>> >> >> > > > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel