Hi Rui, On Mon, Nov 5, 2012 at 2:36 PM, Rui Lopes <[email protected]> wrote: > I have built a benchmark for my custom dot kernel, pasted below. It only > outperforms cpu dot for big sizes, expectable in my educated guess.
Yes, it is to be expected for your kernel, especially on slow video cards. Also, the summation step in your kernel is quite ineffective. Have a look at the reduction example from CUDA SDK (or at pycuda.gpuarray.sum() sources). Alternatively, you can specialize matrixMul example from CUDA SDK for your needs. > When ITERS goes up to 10000, there is a drastic overhead. Is this function > call > overhead? There is indeed some overhead in the function call, but it is hard to estimate it in your examples, because a single call to your kernel takes very little time. In addition, the overhead can be usually hidden by serializing kernel calls into a stream. The rapid drop in performance with large ITERS may be caused by the garbage collector deciding to do some work (adding gc.collect() to each of the iterations seems to even things out). > Moreover, for big sizes some of the outputs don't match. It seems to be a consequence of using single precision floats and summing numbers of significantly different orders of magnitude. I tried filling the arrays with numpy.random.rand() instead, GPU and numpy results are equal (with the expected difference of the order ~1e-7). _______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
