[petsc-dev] VecMDot_SeqCUSP improved

Karl Rupp Tue, 26 Mar 2013 14:15:42 -0500

Hi Jose,

here's the benchmark data obtained on my local machine running an NVIDIA 
GTX 285 for vectors of size 100k:


# Master

./ex43 -n 100000 -k 200 -mdot -log_summary
VecMDot  5.6363e+01

./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
VecMDot  2.1936e+01

./ex43 -n 100000 -k 200 -log_summary
VecDot   5.1124e+01

./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
VecDot   4.0968e+01


# Next

./ex43 -n 100000 -k 200 -mdot -log_summary
VecMDot  5.6417e+01

./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
VecMDot  1.0281e+01

./ex43 -n 100000 -k 200 -log_summary
VecDot   5.0886e+01

./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
VecDot   4.1905e+01


This makes 10sec in next vs. 20 sec. on master for the CUDA-accelerated 
mdot-case. The factor of two is actually as expected, because in the 
'old' kernel the data movement is twice as what it is in the custom 
kernel version. The factor of four with respect to VecDot is not 
entirely clear to me, I'd rather expect a factor close to 2. Presumably 
the more frequent host <-> device transfers add extra overhead.

Best regards,
Karli



On 03/26/2013 10:39 AM, Jose E. Roman wrote:
>
> El 26/03/2013, a las 02:41, Karl Rupp escribi?:
>
>> Hi Jose, Paul, and others,
>>
>> I worked today and VecMDot and came up with an implementation which is 
>> faster than an iterated application of the standard cusp::blas::dot() 
>> (which, if I'm not mistaken, just forwards to CUBLAS) if enough vectors 
>> (>~6) are involved. For complex arithmetic, an iterated application of 
>> cusp::blas::dotc() is used, since passing complex types to CUDA kernels is 
>> fairly tricky within PETSc. Jose, any performance feedback from within SLEPc 
>> is appreciated :-)
>>
>> The new implementation is based on custom kernels, only allocates a little 
>> scratchpad memory and is thus more memory efficient than the old version. 
>> Also, any unnecessary copying of data is avoided. This should speed up GMRES 
>> quite a bit, yet I haven't run any dedicated GMRES benchmarks. Paul, I guess 
>> you have some samples at hand, don't you?
>>
>> Best regards,
>> Karli
>
> In my tests, the new implementation is actually slower. I tried 
> src/vec/vec/examples/tests/ex43.c with 200 vectors of length 10000. Time 
> increases from 4.1 to 7.2. Can anyone try to repeat the tests below?
>
> I have an Intel Core i7 with two Tesla C2050.
>
> Jose
>
>
> master
> ---------------
>
> $ ./ex43 -n 10000 -k 200 -mdot -log_summary
>
> VecMDot             3980 1.0 3.6485e+00 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 11100  0  0  0  11100  0  0  0  2182
>
> $ ./ex43 -n 10000 -k 200 -mdot -log_summary -vec_type cusp
>
> VecMDot             3980 1.0 4.1368e+00 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 40100  0  0  0  40100  0  0  0  1924
>
> $ ./ex43 -n 10000 -k 200 -log_summary
>
> VecDot            398000 1.0 2.1585e+01 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 78100  0  0  0  78100  0  0  0   369
>
> $ ./ex43 -n 10000 -k 200 -log_summary -vec_type cusp
>
> VecDot            398000 1.0 2.9228e+01 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 82100  0  0  0  82100  0  0  0   272
>
>
> next
> ---------------
>
> $ ./ex43 -n 10000 -k 200 -mdot -log_summary
>
> VecMDot             3980 1.0 3.6899e+00 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 39100  0  0  0  39100  0  0  0  2157
>
> $ ./ex43 -n 10000 -k 200 -mdot -log_summary -vec_type cusp
>
> VecMDot             3980 1.0 7.1823e+00 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 54100  0  0  0  54100  0  0  0  1108
>
> $ ./ex43 -n 10000 -k 200 -log_summary
>
> VecDot            398000 1.0 2.1702e+01 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 79100  0  0  0  79100  0  0  0   367
>
> $ ./ex43 -n 10000 -k 200 -log_summary -vec_type cusp
>
> VecDot            398000 1.0 2.8953e+01 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 82100  0  0  0  82100  0  0  0   275
>

[petsc-dev] VecMDot_SeqCUSP improved

Reply via email to