Fil Peters <fil.pet...@yandex.com> writes: > Thanks for the answer, it is a pity that you it is not possible to use this > functions, especially since it also seems not possible to use the cublas > functions in the source modules. In order to be able to use the gpu array > functions in a large loop one has to prevent the copy to the main memory. To > take from the simple speed test example: > > ################## > # GPUArray SECTION > # The result is copied back to main memory on each iteration, this is a > bottleneck > .... > .... > for i in range(n_iter): > a_gpu = pycuda.cumath.sin(a_gpu) > end.record() # end timing > ..... > .... > > would it be possible to not copy the result to the main memory (I also do not > see why the result needs to be copied back to the main memory, it looks more > logic to me only to copy when you ask for it).
If you're talking about the memory bandwidth hit you're taking from the global/store load, that's easy to solve: Investigate ElementwiseKernel in PyCUDA. That lets you merge multiple operations so that only one fetch/store cycle is needed. HTH, Andreas
signature.asc
Description: PGP signature
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda