[petsc-dev] VecMDot_SeqCUSP improved

Karl Rupp Mon, 25 Mar 2013 20:41:42 -0500

Hi Jose, Paul, and others,

I worked today and VecMDot and came up with an implementation which is 
faster than an iterated application of the standard cusp::blas::dot() 
(which, if I'm not mistaken, just forwards to CUBLAS) if enough vectors 
(>~6) are involved. For complex arithmetic, an iterated application of 
cusp::blas::dotc() is used, since passing complex types to CUDA kernels 
is fairly tricky within PETSc. Jose, any performance feedback from 
within SLEPc is appreciated :-)


The new implementation is based on custom kernels, only allocates a 
little scratchpad memory and is thus more memory efficient than the old 
version. Also, any unnecessary copying of data is avoided. This should 
speed up GMRES quite a bit, yet I haven't run any dedicated GMRES 
benchmarks. Paul, I guess you have some samples at hand, don't you?

Best regards,
Karli

[petsc-dev] VecMDot_SeqCUSP improved

Reply via email to