Re: [PyCuda] Slow Device to Array copy

Andreas Klöckner Thu, 05 Feb 2009 09:36:12 -0800

On Donnerstag 05 Februar 2009, J-Pascal Mercier wrote:
> Hi,
>
> I have a kernel that is invoked in loop with the data calculated from
> the last kernel iteration. The kernel uses textures as input data. Right
> now, i use the function Memcpy2D/3D to copy the resulting GPUarray back
> to a texture but unfortunately this operation is very slow. I have only
> been able to achieve 3-4GB/s which is way lower than the 50-60 GB/s i
> can achieve in C with the fct cudaMemcpyToArray which unfortunately is
> part of the Runtime API. My guess is that the problem comes from
> parameters of Memcpy2D/3D but i can't get the right one to speed up the
> process. The function looks like :


Odd--that sounds like the data is actually crossing the PCIe bus, which would 
be less than useful.

I have a suspicion: Your memory pitch is off. The manpage for cuMemAllocPitch 
says this here:

The pitch returned by cuMemAllocPitch() is guaranteed to work with 
cuMemcpy2D() under all circumstances. For allocations of 2D arrays, it is 
recommended that programmers consider performing pitch allocations using 
cuMemAllocPitch(). Due to alignment restrictions in the hardware, this is 
especially true if the application will be performing 2D memory copies between 
different regions of device memory (whether linear memory or CUDA arrays).

That reveals a small deficiency in PyCuda: There needs to be a way to allocate 
GPUArrays that results in cuMemAllocPitch being used for the allocation. I'll 
look into that (but if you're willing to cook up a patch, that wouldn't hurt, 
either.) In the meantime, can you check (using just 
pycuda.driver.mem_alloc_pitch) whether that fixes it?

Andreas

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
PyCuda mailing list
[email protected]
http://tiker.net/mailman/listinfo/pycuda_tiker.net

Re: [PyCuda] Slow Device to Array copy

Reply via email to