Hi Jose, Paul, and others, I worked today and VecMDot and came up with an implementation which is faster than an iterated application of the standard cusp::blas::dot() (which, if I'm not mistaken, just forwards to CUBLAS) if enough vectors (>~6) are involved. For complex arithmetic, an iterated application of cusp::blas::dotc() is used, since passing complex types to CUDA kernels is fairly tricky within PETSc. Jose, any performance feedback from within SLEPc is appreciated :-)
The new implementation is based on custom kernels, only allocates a little scratchpad memory and is thus more memory efficient than the old version. Also, any unnecessary copying of data is avoided. This should speed up GMRES quite a bit, yet I haven't run any dedicated GMRES benchmarks. Paul, I guess you have some samples at hand, don't you? Best regards, Karli
