El 26/03/2013, a las 20:15, Karl Rupp escribi?:

> Hi Jose,
> 
> here's the benchmark data obtained on my local machine running an NVIDIA GTX 
> 285 for vectors of size 100k:
> 
> # Master
> 
> ./ex43 -n 100000 -k 200 -mdot -log_summary
> VecMDot  5.6363e+01
> 
> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
> VecMDot  2.1936e+01
> 
> ./ex43 -n 100000 -k 200 -log_summary
> VecDot   5.1124e+01
> 
> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
> VecDot   4.0968e+01
> 
> 
> # Next
> 
> ./ex43 -n 100000 -k 200 -mdot -log_summary
> VecMDot  5.6417e+01
> 
> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
> VecMDot  1.0281e+01
> 
> ./ex43 -n 100000 -k 200 -log_summary
> VecDot   5.0886e+01
> 
> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
> VecDot   4.1905e+01
> 
> 
> This makes 10sec in next vs. 20 sec. on master for the CUDA-accelerated 
> mdot-case. The factor of two is actually as expected, because in the 'old' 
> kernel the data movement is twice as what it is in the custom kernel version. 
> The factor of four with respect to VecDot is not entirely clear to me, I'd 
> rather expect a factor close to 2. Presumably the more frequent host <-> 
> device transfers add extra overhead.
> 
> Best regards,
> Karli

Here are my numbers for this size. They are similar to yours (a bit worse, 
though). Also, I tried with ViennaCL which gave very poor performance (is this 
normal?).

# Master

./ex43 -n 100000 -k 200 -mdot -log_summary
VecMDot  4.0681e+01

./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
VecMDot  2.4489e+01

./ex43 -n 100000 -k 200 -log_summary
VecDot   5.9457e+01

./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
VecDot   5.0021e+01

# Next

./ex43 -n 100000 -k 200 -mdot -log_summary
VecMDot  4.4252e+01

./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
VecMDot  1.2176e+01

./ex43 -n 100000 -k 200 -log_summary
VecDot   5.9847e+01

./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
VecDot   5.0080e+01

# ViennaCL

./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type viennacl
VecMDot 9.4478e+01

./ex43 -n 100000 -k 200 -log_summary -vec_type viennacl
VecMDot 1.2311e+02


I tried a full SLEPc computation, with a matrix of order 256,000 and making 
VecMDot operate on 40 vectors. The gain from 'master' to 'next' is 91 seconds 
to 53 seconds. So, yes it is good improvement. Thanks. However, I still see 
only a modest speedup (about 4) with respect to CPU (since we do some 
optimizations for the CPU). Also, performance depends a lot on the different 
matrix dimensions. I have to figure out how to optimize it more for the GPU as 
well.

Jose

Reply via email to