Hi Paul, you're very welcome. I'm glad it works out fine for your case as well.
Best regards, Karli On 04/07/2013 11:17 PM, Paul Mullowney wrote: > VecMDot is performing great for my examples. About 2X faster than the > trivial implementation that I originally suggested when I reported the > problem. > Thanks Karl. > -Paul >> El 26/03/2013, a las 20:15, Karl Rupp escribi?: >> >>> Hi Jose, >>> >>> here's the benchmark data obtained on my local machine running an >>> NVIDIA GTX 285 for vectors of size 100k: >>> >>> # Master >>> >>> ./ex43 -n 100000 -k 200 -mdot -log_summary >>> VecMDot 5.6363e+01 >>> >>> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp >>> VecMDot 2.1936e+01 >>> >>> ./ex43 -n 100000 -k 200 -log_summary >>> VecDot 5.1124e+01 >>> >>> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp >>> VecDot 4.0968e+01 >>> >>> >>> # Next >>> >>> ./ex43 -n 100000 -k 200 -mdot -log_summary >>> VecMDot 5.6417e+01 >>> >>> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp >>> VecMDot 1.0281e+01 >>> >>> ./ex43 -n 100000 -k 200 -log_summary >>> VecDot 5.0886e+01 >>> >>> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp >>> VecDot 4.1905e+01 >>> >>> >>> This makes 10sec in next vs. 20 sec. on master for the >>> CUDA-accelerated mdot-case. The factor of two is actually as >>> expected, because in the 'old' kernel the data movement is twice as >>> what it is in the custom kernel version. The factor of four with >>> respect to VecDot is not entirely clear to me, I'd rather expect a >>> factor close to 2. Presumably the more frequent host<-> device >>> transfers add extra overhead. >>> >>> Best regards, >>> Karli >> Here are my numbers for this size. They are similar to yours (a bit >> worse, though). Also, I tried with ViennaCL which gave very poor >> performance (is this normal?). >> >> # Master >> >> ./ex43 -n 100000 -k 200 -mdot -log_summary >> VecMDot 4.0681e+01 >> >> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp >> VecMDot 2.4489e+01 >> >> ./ex43 -n 100000 -k 200 -log_summary >> VecDot 5.9457e+01 >> >> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp >> VecDot 5.0021e+01 >> >> # Next >> >> ./ex43 -n 100000 -k 200 -mdot -log_summary >> VecMDot 4.4252e+01 >> >> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp >> VecMDot 1.2176e+01 >> >> ./ex43 -n 100000 -k 200 -log_summary >> VecDot 5.9847e+01 >> >> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp >> VecDot 5.0080e+01 >> >> # ViennaCL >> >> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type viennacl >> VecMDot 9.4478e+01 >> >> ./ex43 -n 100000 -k 200 -log_summary -vec_type viennacl >> VecMDot 1.2311e+02 >> >> >> I tried a full SLEPc computation, with a matrix of order 256,000 and >> making VecMDot operate on 40 vectors. The gain from 'master' to 'next' >> is 91 seconds to 53 seconds. So, yes it is good improvement. Thanks. >> However, I still see only a modest speedup (about 4) with respect to >> CPU (since we do some optimizations for the CPU). Also, performance >> depends a lot on the different matrix dimensions. I have to figure out >> how to optimize it more for the GPU as well. >> >> Jose >> >
