VecMDot is performing great for my examples. About 2X faster than the
trivial implementation that I originally suggested when I reported the
problem.
Thanks Karl.
-Paul
> El 26/03/2013, a las 20:15, Karl Rupp escribi?:
>
>> Hi Jose,
>>
>> here's the benchmark data obtained on my local machine running an NVIDIA GTX
>> 285 for vectors of size 100k:
>>
>> # Master
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary
>> VecMDot 5.6363e+01
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
>> VecMDot 2.1936e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary
>> VecDot 5.1124e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
>> VecDot 4.0968e+01
>>
>>
>> # Next
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary
>> VecMDot 5.6417e+01
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
>> VecMDot 1.0281e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary
>> VecDot 5.0886e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
>> VecDot 4.1905e+01
>>
>>
>> This makes 10sec in next vs. 20 sec. on master for the CUDA-accelerated
>> mdot-case. The factor of two is actually as expected, because in the 'old'
>> kernel the data movement is twice as what it is in the custom kernel
>> version. The factor of four with respect to VecDot is not entirely clear to
>> me, I'd rather expect a factor close to 2. Presumably the more frequent
>> host<-> device transfers add extra overhead.
>>
>> Best regards,
>> Karli
> Here are my numbers for this size. They are similar to yours (a bit worse,
> though). Also, I tried with ViennaCL which gave very poor performance (is
> this normal?).
>
> # Master
>
> ./ex43 -n 100000 -k 200 -mdot -log_summary
> VecMDot 4.0681e+01
>
> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
> VecMDot 2.4489e+01
>
> ./ex43 -n 100000 -k 200 -log_summary
> VecDot 5.9457e+01
>
> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
> VecDot 5.0021e+01
>
> # Next
>
> ./ex43 -n 100000 -k 200 -mdot -log_summary
> VecMDot 4.4252e+01
>
> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
> VecMDot 1.2176e+01
>
> ./ex43 -n 100000 -k 200 -log_summary
> VecDot 5.9847e+01
>
> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
> VecDot 5.0080e+01
>
> # ViennaCL
>
> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type viennacl
> VecMDot 9.4478e+01
>
> ./ex43 -n 100000 -k 200 -log_summary -vec_type viennacl
> VecMDot 1.2311e+02
>
>
> I tried a full SLEPc computation, with a matrix of order 256,000 and making
> VecMDot operate on 40 vectors. The gain from 'master' to 'next' is 91 seconds
> to 53 seconds. So, yes it is good improvement. Thanks. However, I still see
> only a modest speedup (about 4) with respect to CPU (since we do some
> optimizations for the CPU). Also, performance depends a lot on the different
> matrix dimensions. I have to figure out how to optimize it more for the GPU
> as well.
>
> Jose
>