VecMDot is performing great for my examples. About 2X faster than the 
trivial implementation that I originally suggested when I reported the 
problem.
Thanks Karl.
-Paul
> El 26/03/2013, a las 20:15, Karl Rupp escribi?:
>
>> Hi Jose,
>>
>> here's the benchmark data obtained on my local machine running an NVIDIA GTX 
>> 285 for vectors of size 100k:
>>
>> # Master
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary
>> VecMDot  5.6363e+01
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
>> VecMDot  2.1936e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary
>> VecDot   5.1124e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
>> VecDot   4.0968e+01
>>
>>
>> # Next
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary
>> VecMDot  5.6417e+01
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
>> VecMDot  1.0281e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary
>> VecDot   5.0886e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
>> VecDot   4.1905e+01
>>
>>
>> This makes 10sec in next vs. 20 sec. on master for the CUDA-accelerated 
>> mdot-case. The factor of two is actually as expected, because in the 'old' 
>> kernel the data movement is twice as what it is in the custom kernel 
>> version. The factor of four with respect to VecDot is not entirely clear to 
>> me, I'd rather expect a factor close to 2. Presumably the more frequent 
>> host<->  device transfers add extra overhead.
>>
>> Best regards,
>> Karli
> Here are my numbers for this size. They are similar to yours (a bit worse, 
> though). Also, I tried with ViennaCL which gave very poor performance (is 
> this normal?).
>
> # Master
>
> ./ex43 -n 100000 -k 200 -mdot -log_summary
> VecMDot  4.0681e+01
>
> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
> VecMDot  2.4489e+01
>
> ./ex43 -n 100000 -k 200 -log_summary
> VecDot   5.9457e+01
>
> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
> VecDot   5.0021e+01
>
> # Next
>
> ./ex43 -n 100000 -k 200 -mdot -log_summary
> VecMDot  4.4252e+01
>
> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
> VecMDot  1.2176e+01
>
> ./ex43 -n 100000 -k 200 -log_summary
> VecDot   5.9847e+01
>
> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
> VecDot   5.0080e+01
>
> # ViennaCL
>
> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type viennacl
> VecMDot 9.4478e+01
>
> ./ex43 -n 100000 -k 200 -log_summary -vec_type viennacl
> VecMDot 1.2311e+02
>
>
> I tried a full SLEPc computation, with a matrix of order 256,000 and making 
> VecMDot operate on 40 vectors. The gain from 'master' to 'next' is 91 seconds 
> to 53 seconds. So, yes it is good improvement. Thanks. However, I still see 
> only a modest speedup (about 4) with respect to CPU (since we do some 
> optimizations for the CPU). Also, performance depends a lot on the different 
> matrix dimensions. I have to figure out how to optimize it more for the GPU 
> as well.
>
> Jose
>

Reply via email to