Hi Paul, > I agree. Just thought I would share something that works easily now for > the small nv case.
It still serves as a nice comparison for benchmarking :-) Best regards, Karli >> >> thanks, this should work well for nv = 1, 2, maybe 3. However, it >> won't help Jose a lot. Clearly, for nv >> 1, there are a bunch of >> unnecessary loads of xarray. Thus, a two-kernel approach is necessary >> to handle both extremes (nv \approx 1 and nv >> 1). >> >> Best regards, >> Karli >> >> >> >> >>>> Hi Jose, >>>> >>>>>> Since I just stumbled over VecMDot_SeqCUSP() when interfacing >>>>>> ViennaCL: Do you know what was the reason why the 'old' version was >>>>>> replaced by this expensive call to gemv() including the creation of >>>>>> temporaries, etc.? Just writing a custom kernel with one work group >>>>>> per dot-product should do the job perfectly, shouldn't it? >>>>> >>>>> My fault: >>>>> https://bitbucket.org/petsc/petsc-hg/commits/ec7a7de2acd477e5edd24cc5a3af441ce7a68a36 >>>>> >>>>> >>>>> >>>>> The motivation was that the previous version was even worse for me >>>>> (VecMDot is used a lot in SLEPc and GPU performance was really bad). >>>>> At that time I did not have the time to write a custom kernel. If you >>>>> write one, I could help in testing and measuring performance. >>>> >>>> Thanks for providing the context. It makes sense to me now, because >>>> for eigenvalue computations you typically have a lot more vectors >>>> taking part in mdot as compared to GMRES. This looks like an >>>> archetypal example for using two different kernels: The first is >>>> suitable for 'small' numbers of vectors (GMRES), while the second is >>>> more gemv-like and good for larger vector counts (SLEPc). I'll let you >>>> know as soon as it's ready for testing. >>>> >>>> Thanks and best regards, >>>> Karli >>> >
