It seems that VecMDot_SeqCUSP has rather poor performance. This has a lot of 
impact in SLEPc because it is the main kernel used in the orthogonalization of 
vectors.

Is this due to the version of Thrust? I am using CUDA Toolkit 4.0.

I tried a naive replacement that copies the contents of the vectors into a 
matrix and calls CUBLAS dgemv. The improvement is significant, despite the data 
movement overhead. In some tests I see a reduction of time (VecReduceArith) 
from 24.5 seconds to 9.6 seconds (with up to 200 vectors of length 10000) on a 
Fermi.

I can send the code for you to try.

Jose


Reply via email to