It seems that VecMDot_SeqCUSP has rather poor performance. This has a lot of impact in SLEPc because it is the main kernel used in the orthogonalization of vectors.
Is this due to the version of Thrust? I am using CUDA Toolkit 4.0. I tried a naive replacement that copies the contents of the vectors into a matrix and calls CUBLAS dgemv. The improvement is significant, despite the data movement overhead. In some tests I see a reduction of time (VecReduceArith) from 24.5 seconds to 9.6 seconds (with up to 200 vectors of length 10000) on a Fermi. I can send the code for you to try. Jose