I am using R to multiply some large (30k x 30k double) matrices on a 64 core
machine (xeon phi). I added some timers to src/main/array.c to see where the
time is going. All of the time is being spent in the matprod function, most of
that time is spent in dgemm. 15 seconds is in matprod in some code that is
checking if there are NaNs.
> system.time (C <- B %*% A)
nancheck: wall time 15.240282s
dgemm: wall time 43.111064s
matprod: wall time 58.351572s
user system elapsed
2710.154 20.999 58.398
The NaN checking code is not being vectorized because of the early exit when
NaN is detected:
/* Don't trust the BLAS to handle NA/NaNs correctly: PR#4582
* The test is only O(n) here.
*/
for (R_xlen_t i = 0; i < NRX*ncx; i++)
if (ISNAN(x[i])) {have_na = TRUE; break;}
if (!have_na)
for (R_xlen_t i = 0; i < NRY*ncy; i++)
if (ISNAN(y[i])) {have_na = TRUE; break;}
I tried deleting the 'break'. By inspecting the asm code, I verified that the
loop was not being vectorized before, but now is vectorized. Total time goes
down:
system.time (C <- B %*% A)
nancheck: wall time 1.898667s
dgemm: wall time 43.913621s
matprod: wall time 45.812468s
user system elapsed
2727.877 20.723 45.859
The break accelerates the case when there is a NaN, at the expense of the much
more common case when there isn't a NaN. If a NaN is detected, it doesn't call
dgemm and calls its own matrix multiply, which makes the NaN check time
insignificant so I doubt the early exit provides any benefit.
I was a little surprised that the O(n) NaN check is costly compared to the
O(n**2) dgemm that follows. I think the reason is that nan check is single
thread and not vectorized, and my machine can do 2048 floating point ops/cycle
when you consider the cores/dual issue/8 way SIMD/muladd, and the constant
factor will be significant for even large matrices.
Would you consider deleting the breaks? I can submit a patch if that will help.
Thanks.
Robert
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel