Re: [PyCUDA] Custom dot issues

Bogdan Opanchuk Sun, 04 Nov 2012 20:22:27 -0800

Hi Rui,

On Mon, Nov 5, 2012 at 2:36 PM, Rui Lopes <[email protected]> wrote:
> I have built a benchmark for my custom dot kernel, pasted below. It only
> outperforms cpu dot for big sizes,  expectable in my educated guess.


Yes, it is to be expected for your kernel, especially on slow video
cards. Also, the summation step in your kernel is quite ineffective.
Have a look at the reduction example from CUDA SDK (or at
pycuda.gpuarray.sum() sources). Alternatively, you can specialize
matrixMul example from CUDA SDK for your needs.

> When ITERS goes up to 10000, there is a drastic overhead. Is this function 
> call
> overhead?

There is indeed some overhead in the function call, but it is hard to
estimate it in your examples, because a single call to your kernel
takes very little time. In addition, the overhead can be usually
hidden by serializing kernel calls into a stream. The rapid drop in
performance with large ITERS may be caused by the garbage collector
deciding to do some work (adding gc.collect() to each of the
iterations seems to even things out).

> Moreover, for big sizes some of the outputs don't match.

It seems to be a consequence of using single precision floats and
summing numbers of significantly different orders of magnitude. I
tried filling the arrays with numpy.random.rand() instead, GPU and
numpy results are equal (with the expected difference of the order
~1e-7).

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] Custom dot issues

Reply via email to