On Donnerstag 05 Februar 2009, J-Pascal Mercier wrote: > Hi, > > I have a kernel that is invoked in loop with the data calculated from > the last kernel iteration. The kernel uses textures as input data. Right > now, i use the function Memcpy2D/3D to copy the resulting GPUarray back > to a texture but unfortunately this operation is very slow. I have only > been able to achieve 3-4GB/s which is way lower than the 50-60 GB/s i > can achieve in C with the fct cudaMemcpyToArray which unfortunately is > part of the Runtime API. My guess is that the problem comes from > parameters of Memcpy2D/3D but i can't get the right one to speed up the > process. The function looks like :
Odd--that sounds like the data is actually crossing the PCIe bus, which would be less than useful. I have a suspicion: Your memory pitch is off. The manpage for cuMemAllocPitch says this here: The pitch returned by cuMemAllocPitch() is guaranteed to work with cuMemcpy2D() under all circumstances. For allocations of 2D arrays, it is recommended that programmers consider performing pitch allocations using cuMemAllocPitch(). Due to alignment restrictions in the hardware, this is especially true if the application will be performing 2D memory copies between different regions of device memory (whether linear memory or CUDA arrays). That reveals a small deficiency in PyCuda: There needs to be a way to allocate GPUArrays that results in cuMemAllocPitch being used for the allocation. I'll look into that (but if you're willing to cook up a patch, that wouldn't hurt, either.) In the meantime, can you check (using just pycuda.driver.mem_alloc_pitch) whether that fixes it? Andreas
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ PyCuda mailing list [email protected] http://tiker.net/mailman/listinfo/pycuda_tiker.net
