Ah, I see -- a newbie error!
Adding a CUDArt.device_synchronize() call after the loop shows that indeed
it's just the GPU catching up.
Many thanks Tim (big fan of your work).
David
On Saturday, November 28, 2015 at 8:28:12 PM UTC, Tim Holy wrote:
>
> Are you sure you're not being fooled by asynchronous operations? launch
> doesn't wait for the kernel to finish before returning. Since gc is a slow
> call, you're presumably just giving the GPU time to catch up on its queue,
> which creates the opportunity to schedule more operations.
>
> In other words, you might observe the same phenomenon with sleep(0.1) that
> you're seeing with gc().
>
> --Tim
>
> On Saturday, November 28, 2015 08:18:25 AM davidino wrote:
> > Hi
> >
> > I'm looking into using CUDArt but am struggling with a basic issue.
> >
> > Following the simple vadd example,
> >
> > extern "C"
> > {
> > __global__ void vadd(const int n, const double *a, const double *b,
> > double *c)
> > {
> > int i = threadIdx.x + blockIdx.x * blockDim.x;
> > if (i<n)
> > {
> > c[i] = a[i] + b[i];
> > }
> > }
> > }
> >
> > I wrote julia code to repeatedly add two matrices:
> >
> > using CUDArt, PyPlot
> >
> > CUDArt.init([0])
> >
> > md=CuModule("vadd.ptx",false)
> > kernel=CuFunction(md,"vadd")
> >
> > function vadd(A::CudaArray,B::CudaArray,C::CudaArray)
> > nblocks=round(Int,ceil(length(B)/1024))
> > launch(kernel,nblocks,1024,(length(A),A,B,C))
> > end
> >
> > N=2000
> > A=CudaArray(rand(N,N))
> > B=CudaArray(rand(N,N))
> > C=CudaArray(zeros(N,N))
> >
> > M=2000
> > tm=zeros(M)
> >
> > for i in 1:M
> > #if i==1000 gc() end
> > tic()
> > vadd(A,B,C)
> > tm[i]=toc()
> > end
> >
> > plot(tm[10:end])
> >
> > The addition of the two matrices goes super fast for about 1000
> iterations,
> > but then dramatically slows (see nogc.png). However, if I gc() after
> 1000
> > iterations (this single gc step takes a few seconds to run) then things
> run
> > quickly again (see withgc.png).
> >
> > Is there any way to avoid having to manually add gc()?
> >
> > I'm also confused where (presumably) the memory leak could be coming
> from.
> > Much of my work involves iterative algorithms, so I really need to
> figure
> > this out before using this otherwise awesome tool.
> >
> > I'm running this on a Jetson TK1. I've tried several different nvcc
> > compile options and all exhibit the same behaviour.
> >
> > Many thanks,
> >
> > David
>
>