Fil Peters <fil.pet...@yandex.com> writes:
> Thanks for the answer, it is a pity that you it is not possible to use this 
> functions, especially since it also seems not possible to use the cublas 
> functions in the source modules. In order to be able to use the gpu array 
> functions in a large loop one has to prevent the copy to the main memory. To 
> take from the simple speed test example:
>
> ##################
> # GPUArray SECTION
> # The result is copied back to main memory on each iteration, this is a 
> bottleneck
> ....
> ....
> for i in range(n_iter):
>     a_gpu = pycuda.cumath.sin(a_gpu)
> end.record() # end timing
> .....
> ....
>
> would it be possible to not copy the result to the main memory (I also do not 
> see why the result needs to be copied back to the main memory, it looks more 
> logic to me only to copy when you ask for it).

If you're talking about the memory bandwidth hit you're taking from the
global/store load, that's easy to solve: Investigate ElementwiseKernel
in PyCUDA. That lets you merge multiple operations so that only one
fetch/store cycle is needed.

HTH,
Andreas

Attachment: signature.asc
Description: PGP signature

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to