Hi 

I'm looking into using CUDArt but am struggling with a basic issue. 

Following the simple vadd example,

extern "C"  
{
    __global__ void vadd(const int n, const double *a, const double *b, 
double *c)
    {
        int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i<n)
    {
        c[i] = a[i] + b[i];
    }
    }
}

I wrote julia code to repeatedly add two matrices:

using CUDArt, PyPlot

CUDArt.init([0])

md=CuModule("vadd.ptx",false)
kernel=CuFunction(md,"vadd")

function vadd(A::CudaArray,B::CudaArray,C::CudaArray)
    nblocks=round(Int,ceil(length(B)/1024))
    launch(kernel,nblocks,1024,(length(A),A,B,C))
end

N=2000
A=CudaArray(rand(N,N))
B=CudaArray(rand(N,N))
C=CudaArray(zeros(N,N))

M=2000
tm=zeros(M)

for i in 1:M
    #if i==1000 gc() end
    tic() 
    vadd(A,B,C)
    tm[i]=toc()
end

plot(tm[10:end])

The addition of the two matrices goes super fast for about 1000 iterations, 
but then dramatically slows (see nogc.png). However, if I gc() after 1000 
iterations (this single gc step takes a few seconds to run) then things run 
quickly again (see withgc.png).

Is there any way to avoid having to manually add gc()?

I'm also confused where (presumably) the memory leak could be coming from. 
Much of my work involves iterative algorithms, so I really need to figure 
this out before using this otherwise awesome tool.

I'm running this on a Jetson TK1.  I've tried several different nvcc 
compile options and all exhibit the same behaviour.

Many thanks,

David

Reply via email to