Re: [PyCUDA] Performance Issues

Apostolis Glenis Tue, 31 Jul 2012 12:00:52 -0700

I think the solution was to do something like:
/dev/nvidia > /dev/null ,
not sure though


this seems relavent
http://reference.wolfram.com/mathematica/CUDALink/tutorial/Headless.html
http://blog.njoubert.com/2010/10/running-cuda-without-running-x.html

they also suggested running nvidia-smi in persistant mode which could
either be the solution to your problem,or what you have been doing sto far.

Apostolis

2012/7/31 Leandro Demarco Vedelago <[email protected]>

> Ok, I think you found the source of my problem Apostolis.
>
> I profiled the execution both on the server and on the laptop and the
> calls to memcpy with no pinned were considerably faster on the servers
> Tesla than in the laptop's GT 540M and pinned memory transfer took
> about the same time in both.
>
> From your previous email, I decided to give a look into the (py)cuda
> initialization. So I removed the pycuda.autoinit import and made the
> initialization "by hand" to perform some rustic time-measuring. I
> added the following lines at the start of the benchmark() function:
>    print "Starting initialization"
>    cuda.init()
>    dev = cuda.Device(0)
>    ctx = dev.make_context()
>    print "Initialization finished"
>
>
> So I ran this modified code, and in the laptop, it executed pretty
> fast, the time ellapsed between both prints less than a second. But
> when I ran it on the server, there were about 10 seconds between the
> first print and the last one.
>
> Upon the receiving of your last e-mail I ran nvidia-smi first and then
> the program with no changes. But then I tried leaving nvidia-smi
> looping with the -l argument and run the program on another tty and,
> to my surprise it ran in a little less than 2 seconds, against those
> nearly 15 when nvidia-smi ain't looping/executing.
> This is still slower than the laptop, but this particular code is not
> optimized for multi-gpu and there could be other factors like the
> communication latency over the PCI bus(which they told me in this list
> it's sometimes lower on laptops) and the fact that I am executing
> remotely via ssh.
>
> As for what you told me about mounting /dev/nvidia, I had to do that
> previously, because as I didn't install a GUI they wouldn't mount on
> boot-time and therefore cuda programs would not detect the devices (I
> had this problem after finishing CUDA installation and running the
> deviceQuery example from the SDK which gave me the "Non capable CUDA
> devices found").
>
> Any further ideas on why this nvidia-smi execution at the same time
> boosts initialization so much? You've been really helpful and I really
> appreciate your help, even if you cannot help me any more (I'll just
> have to wait those damned Nvidia forums come back :) )
>
>
> On Tue, Jul 31, 2012 at 1:52 PM, Apostolis Glenis <[email protected]>
> wrote:
> > I think is the same case.
> > The NVIDIA driver is initialized when X-windows starts or at the first
> > execution of a GPU program.
> > Could you try nvidia-smi first and then your program.
> > I have read somewhere (i think in the thrust-users mailing list) that you
> > have to load /dev/nvidia first or something like that.
> > The closer thing I could find was that:
> > http://www.gpugrid.net/forum_thread.php?id=266
> >
> >
> > 2012/7/31 Leandro Demarco Vedelago <[email protected]>
> >>
> >> Apostolis, I'm not using X windows, as I did not install any GUI on the
> >> server
> >>
> >> On Tue, Jul 31, 2012 at 11:46 AM, Apostolis Glenis
> >> <[email protected]> wrote:
> >> > maybe it has to do with the initialization of the GPU if another gpu
> is
> >> > responsible for X windows.
> >> >
> >> >
> >> > 2012/7/31 Leandro Demarco Vedelago <[email protected]>
> >> >>
> >> >> Just to add a concrete and simple example, that I gues will clarify
> mi
> >> >> situation. The following code creates two buffer on the host side,
> one
> >> >> pagelocked and the other a common one, and then copies/writes to a
> gpu
> >> >> buffer and evaluates performance using events for time measuring.
> >> >> It's really simple indeed, there's no execution on multiple gpu,  but
> >> >> i would expect it to run in more or less the same time in the server
> >> >> using just one of the Teslas.
> >> >> However, it takes less than a second to run in my laptop and nearly
> 15
> >> >> seconds on the server!!!
> >> >>
> >> >> import pycuda.driver as cuda
> >> >> import pycuda.autoinit
> >> >> import numpy as np
> >> >>
> >> >> def benchmark(up):
> >> >>         """ Up is a boolean flag. If set to True, benchmark is ran
> >> >> copying
> >> >> from
> >> >>                 host to device; if false, the benchmark is ran the
> >> >> other
> >> >> way round
> >> >>         """
> >> >>
> >> >>         # Buffers size
> >> >>         size = 10*1024*1024
> >> >>
> >> >>         # Host and device buffer, equally-shaped. We don't care about
> >> >> their contents
> >> >>         cpu_buff = np.empty(size, np.dtype('u1'))
> >> >>         cpu_locked_buff = cuda.pagelocked_empty(size, np.dtype('u1'))
> >> >>         gpu_buff = cuda.mem_alloc(cpu_buff.nbytes)
> >> >>
> >> >>         # Events for measuring execution time; first two, for not
> >> >> pinned
> >> >> buffer,
> >> >>         # las 2 for pinned(locked) buffer
> >> >>         startn = cuda.Event()
> >> >>         endn = cuda.Event()
> >> >>         startl = cuda.Event()
> >> >>         endl = cuda.Event()
> >> >>
> >> >>         if (up):
> >> >>                 startn.record()
> >> >>                 cuda.memcpy_htod(gpu_buff, cpu_buff)
> >> >>                 endn.record()
> >> >>                 endn.synchronize()
> >> >>                 t1 = endn.time_since(startn)
> >> >>
> >> >>                 startl.record()
> >> >>                 cuda.memcpy_htod(gpu_buff, cpu_locked_buff)
> >> >>                 endl.record()
> >> >>                 endl.synchronize()
> >> >>                 t2 = endl.time_since(startl)
> >> >>
> >> >>                 print "From host to device benchmark results: \n"
> >> >>                 print "Time for copying from normal host mem: %i
> ms\n"
> >> >> %
> >> >> t1
> >> >>                 print "Time for copying from pinned host mem: %i
> ms\n"
> >> >> %
> >> >> t2
> >> >>
> >> >>                 diff = t1-t2
> >> >>                 if (diff > 0):
> >> >>                         print "Copy from pinned memory was %i ms
> >> >> faster\n"
> >> >> % diff
> >> >>                 else:
> >> >>                         print "Copy from pinned memory was %i ms
> >> >> slower\n"
> >> >> % diff
> >> >>
> >> >>         else:
> >> >>                 startn.record()
> >> >>                 cuda.memcpy_dtoh(cpu_buff, gpu_buff)
> >> >>                 endn.record()
> >> >>                 endn.synchronize()
> >> >>                 t1 = endn.time_since(startn)
> >> >>
> >> >>                 startl.record()
> >> >>                 cuda.memcpy_dtoh(cpu_locked_buff, gpu_buff)
> >> >>                 endl.record()
> >> >>                 endl.synchronize()
> >> >>                 t2 = endl.time_since(startl)
> >> >>
> >> >>                 print "From device to host benchmark results: \n"
> >> >>                 print "Time for copying to normal host mem: %i ms\n"
> %
> >> >> t1
> >> >>                 print "Time for copying to pinned host mem: %i ms\n"
> %
> >> >> t2
> >> >>
> >> >>                 diff = t1-t2
> >> >>                 if (diff > 0):
> >> >>                         print "Copy to pinned memory was %i ms
> >> >> faster\n" %
> >> >> diff
> >> >>                 else:
> >> >>                         print "Copy to pinned memory was %i ms
> >> >> slower\n" %
> >> >> diff
> >> >>
> >> >> benchmark(up=False)
> >> >>
> >> >>
> >> >> On Mon, Jul 30, 2012 at 3:22 PM, Leandro Demarco Vedelago
> >> >> <[email protected]> wrote:
> >> >> > ---------- Forwarded message ----------
> >> >> > From: Leandro Demarco Vedelago <[email protected]>
> >> >> > Date: Mon, Jul 30, 2012 at 2:57 PM
> >> >> > Subject: Re: [PyCUDA] Performance Issues
> >> >> > To: Brendan Wood <[email protected]>, [email protected]
> >> >> >
> >> >> >
> >> >> > Brendan:
> >> >> > Basically, all the examples are computing the dot product of 2
> large
> >> >> > vectors. But in each example some new concept is introduced (pinned
> >> >> > memory, streams, etc).
> >> >> > The last example is the one that incorporates multiple-gpu.
> >> >> >
> >> >> > As for the work done, I am generating the data randomly and, making
> >> >> > some tests at the end in the host side, which considerably
> increases
> >> >> > ex ecution time, but as this are "learning examples" I was not
> >> >> > specially worried about it. But I would have expected that given
> that
> >> >> > the server has way more powerful hardware (the 3 teslas 2075 and 4
> >> >> > intel xeon with 6 cores each and 48 GB ram) programs would run
> >> >> > faster,
> >> >> > in particular this last example that is designed to work with
> >> >> > multiples-gpu's.
> >> >> >
> >> >> > I compiled and ran the bandwith test and the queryDevice samples
> from
> >> >> > the SDK and they both passed, if that is what you meant.
> >> >> >
> >> >> > Now answering to Andreas:
> >> >> > yes, I'm using one thread per each GPU (as the way it's done in the
> >> >> > wiki example) and yes, the server has way more than 3 CPU's. As for
> >> >> > the SCHED_BLOCKING_SYNC flag, should I pass it as an argument for
> >> >> > each
> >> >> > device context. What does this flag do?
> >> >> >
> >> >> > Thank you both for your answers
> >> >> >
> >> >> > On Mon, Jul 30, 2012 at 12:47 AM, Brendan Wood
> >> >> > <[email protected]>
> >> >> > wrote:
> >> >> >> Hi Leandro,
> >> >> >>
> >> >> >> Without knowing exactly what examples you're running, it may be
> hard
> >> >> >> to
> >> >> >> say what the problem is.  In fact, you may not really have a
> >> >> >> problem.
> >> >> >>
> >> >> >> How much work is being done in each example program?  Is it enough
> >> >> >> to
> >> >> >> really work the GPU, or is communication and other overhead
> >> >> >> dominating
> >> >> >> runtime?  Note that laptops may have lower communication latency
> >> >> >> over
> >> >> >> the PCI bus than desktops/servers, which can make small programs
> run
> >> >> >> much faster on laptops regardless of how much processing power the
> >> >> >> GPU
> >> >> >> has.
> >> >> >>
> >> >> >> Have you tried running the sample code from the SDK, so that you
> can
> >> >> >> verify that it's not a code problem?
> >> >> >>
> >> >> >> Regards,
> >> >> >>
> >> >> >> Brendan Wood
> >> >> >>
> >> >> >>
> >> >> >> On Sun, 2012-07-29 at 23:59 -0300, Leandro Demarco Vedelago wrote:
> >> >> >>> Hello: I've been reading and learning CUDA in the last few weeks
> >> >> >>> and
> >> >> >>> last week I started writing (translating to Pycuda from Cuda-C)
> >> >> >>> some
> >> >> >>> examples taken from the book "Cuda by Example".
> >> >> >>> I started coding on a laptop with just one nvidia GPU (a gtx 560M
> >> >> >>> if
> >> >> >>> my memory is allright) with Windows 7.
> >> >> >>>
> >> >> >>> But in the project I'm currently working at, we intend to run
> >> >> >>> (py)cuda
> >> >> >>> on a multi-gpu server that has three Tesla C2075 cards.
> >> >> >>>
> >> >> >>> So I installed Ubuntu server 10.10 (with no  GUI) and managed to
> >> >> >>> install and get running the very same examples I ran on the
> >> >> >>> single-gpu
> >> >> >>> laptop. However they run really slow, in some cases it takes 3
> >> >> >>> times
> >> >> >>> more than in the laptop. And this happens with most, if not all,
> >> >> >>> the
> >> >> >>> examples I wrote.
> >> >> >>>
> >> >> >>> I thought it could be a driver issue but I double-checked and
> I've
> >> >> >>> installed the correct ones, meaning those listed on the CUDA Zone
> >> >> >>> section of nvidia.com for linux 64-bits. So I'm kind of lost
> right
> >> >> >>> now
> >> >> >>> and was wondering if anyone has had this or somewhat similar
> >> >> >>> problem
> >> >> >>> running on a server.
> >> >> >>>
> >> >> >>> Sorry for the English, but it's not my native language.
> >> >> >>>
> >> >> >>> Thanks in advance, Leandro Demarco
> >> >> >>>
> >> >> >>> _______________________________________________
> >> >> >>> PyCUDA mailing list
> >> >> >>> [email protected]
> >> >> >>> http://lists.tiker.net/listinfo/pycuda
> >> >> >>
> >> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> PyCUDA mailing list
> >> >> [email protected]
> >> >> http://lists.tiker.net/listinfo/pycuda
> >> >
> >> >
> >
> >
>

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] Performance Issues

Reply via email to