Forgot to mention: Code is in 'next': https://bitbucket.org/petsc/petsc/commits/78e6257bdd411e017b354225da1226dab51c07b7
On 03/25/2013 08:41 PM, Karl Rupp wrote: > Hi Jose, Paul, and others, > > I worked today and VecMDot and came up with an implementation which is > faster than an iterated application of the standard cusp::blas::dot() > (which, if I'm not mistaken, just forwards to CUBLAS) if enough vectors > (>~6) are involved. For complex arithmetic, an iterated application of > cusp::blas::dotc() is used, since passing complex types to CUDA kernels > is fairly tricky within PETSc. Jose, any performance feedback from > within SLEPc is appreciated :-) > > The new implementation is based on custom kernels, only allocates a > little scratchpad memory and is thus more memory efficient than the old > version. Also, any unnecessary copying of data is avoided. This should > speed up GMRES quite a bit, yet I haven't run any dedicated GMRES > benchmarks. Paul, I guess you have some samples at hand, don't you? > > Best regards, > Karli
