Hi Jose, here's the benchmark data obtained on my local machine running an NVIDIA GTX 285 for vectors of size 100k:
# Master ./ex43 -n 100000 -k 200 -mdot -log_summary VecMDot 5.6363e+01 ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp VecMDot 2.1936e+01 ./ex43 -n 100000 -k 200 -log_summary VecDot 5.1124e+01 ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp VecDot 4.0968e+01 # Next ./ex43 -n 100000 -k 200 -mdot -log_summary VecMDot 5.6417e+01 ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp VecMDot 1.0281e+01 ./ex43 -n 100000 -k 200 -log_summary VecDot 5.0886e+01 ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp VecDot 4.1905e+01 This makes 10sec in next vs. 20 sec. on master for the CUDA-accelerated mdot-case. The factor of two is actually as expected, because in the 'old' kernel the data movement is twice as what it is in the custom kernel version. The factor of four with respect to VecDot is not entirely clear to me, I'd rather expect a factor close to 2. Presumably the more frequent host <-> device transfers add extra overhead. Best regards, Karli On 03/26/2013 10:39 AM, Jose E. Roman wrote: > > El 26/03/2013, a las 02:41, Karl Rupp escribi?: > >> Hi Jose, Paul, and others, >> >> I worked today and VecMDot and came up with an implementation which is >> faster than an iterated application of the standard cusp::blas::dot() >> (which, if I'm not mistaken, just forwards to CUBLAS) if enough vectors >> (>~6) are involved. For complex arithmetic, an iterated application of >> cusp::blas::dotc() is used, since passing complex types to CUDA kernels is >> fairly tricky within PETSc. Jose, any performance feedback from within SLEPc >> is appreciated :-) >> >> The new implementation is based on custom kernels, only allocates a little >> scratchpad memory and is thus more memory efficient than the old version. >> Also, any unnecessary copying of data is avoided. This should speed up GMRES >> quite a bit, yet I haven't run any dedicated GMRES benchmarks. Paul, I guess >> you have some samples at hand, don't you? >> >> Best regards, >> Karli > > In my tests, the new implementation is actually slower. I tried > src/vec/vec/examples/tests/ex43.c with 200 vectors of length 10000. Time > increases from 4.1 to 7.2. Can anyone try to repeat the tests below? > > I have an Intel Core i7 with two Tesla C2050. > > Jose > > > master > --------------- > > $ ./ex43 -n 10000 -k 200 -mdot -log_summary > > VecMDot 3980 1.0 3.6485e+00 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 11100 0 0 0 11100 0 0 0 2182 > > $ ./ex43 -n 10000 -k 200 -mdot -log_summary -vec_type cusp > > VecMDot 3980 1.0 4.1368e+00 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 40100 0 0 0 40100 0 0 0 1924 > > $ ./ex43 -n 10000 -k 200 -log_summary > > VecDot 398000 1.0 2.1585e+01 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 78100 0 0 0 78100 0 0 0 369 > > $ ./ex43 -n 10000 -k 200 -log_summary -vec_type cusp > > VecDot 398000 1.0 2.9228e+01 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 82100 0 0 0 82100 0 0 0 272 > > > next > --------------- > > $ ./ex43 -n 10000 -k 200 -mdot -log_summary > > VecMDot 3980 1.0 3.6899e+00 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 39100 0 0 0 39100 0 0 0 2157 > > $ ./ex43 -n 10000 -k 200 -mdot -log_summary -vec_type cusp > > VecMDot 3980 1.0 7.1823e+00 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 54100 0 0 0 54100 0 0 0 1108 > > $ ./ex43 -n 10000 -k 200 -log_summary > > VecDot 398000 1.0 2.1702e+01 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 79100 0 0 0 79100 0 0 0 367 > > $ ./ex43 -n 10000 -k 200 -log_summary -vec_type cusp > > VecDot 398000 1.0 2.8953e+01 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 82100 0 0 0 82100 0 0 0 275 >
