I think the solution was to do something like: /dev/nvidia > /dev/null , not sure though
this seems relavent http://reference.wolfram.com/mathematica/CUDALink/tutorial/Headless.html http://blog.njoubert.com/2010/10/running-cuda-without-running-x.html they also suggested running nvidia-smi in persistant mode which could either be the solution to your problem,or what you have been doing sto far. Apostolis 2012/7/31 Leandro Demarco Vedelago <[email protected]> > Ok, I think you found the source of my problem Apostolis. > > I profiled the execution both on the server and on the laptop and the > calls to memcpy with no pinned were considerably faster on the servers > Tesla than in the laptop's GT 540M and pinned memory transfer took > about the same time in both. > > From your previous email, I decided to give a look into the (py)cuda > initialization. So I removed the pycuda.autoinit import and made the > initialization "by hand" to perform some rustic time-measuring. I > added the following lines at the start of the benchmark() function: > print "Starting initialization" > cuda.init() > dev = cuda.Device(0) > ctx = dev.make_context() > print "Initialization finished" > > > So I ran this modified code, and in the laptop, it executed pretty > fast, the time ellapsed between both prints less than a second. But > when I ran it on the server, there were about 10 seconds between the > first print and the last one. > > Upon the receiving of your last e-mail I ran nvidia-smi first and then > the program with no changes. But then I tried leaving nvidia-smi > looping with the -l argument and run the program on another tty and, > to my surprise it ran in a little less than 2 seconds, against those > nearly 15 when nvidia-smi ain't looping/executing. > This is still slower than the laptop, but this particular code is not > optimized for multi-gpu and there could be other factors like the > communication latency over the PCI bus(which they told me in this list > it's sometimes lower on laptops) and the fact that I am executing > remotely via ssh. > > As for what you told me about mounting /dev/nvidia, I had to do that > previously, because as I didn't install a GUI they wouldn't mount on > boot-time and therefore cuda programs would not detect the devices (I > had this problem after finishing CUDA installation and running the > deviceQuery example from the SDK which gave me the "Non capable CUDA > devices found"). > > Any further ideas on why this nvidia-smi execution at the same time > boosts initialization so much? You've been really helpful and I really > appreciate your help, even if you cannot help me any more (I'll just > have to wait those damned Nvidia forums come back :) ) > > > On Tue, Jul 31, 2012 at 1:52 PM, Apostolis Glenis <[email protected]> > wrote: > > I think is the same case. > > The NVIDIA driver is initialized when X-windows starts or at the first > > execution of a GPU program. > > Could you try nvidia-smi first and then your program. > > I have read somewhere (i think in the thrust-users mailing list) that you > > have to load /dev/nvidia first or something like that. > > The closer thing I could find was that: > > http://www.gpugrid.net/forum_thread.php?id=266 > > > > > > 2012/7/31 Leandro Demarco Vedelago <[email protected]> > >> > >> Apostolis, I'm not using X windows, as I did not install any GUI on the > >> server > >> > >> On Tue, Jul 31, 2012 at 11:46 AM, Apostolis Glenis > >> <[email protected]> wrote: > >> > maybe it has to do with the initialization of the GPU if another gpu > is > >> > responsible for X windows. > >> > > >> > > >> > 2012/7/31 Leandro Demarco Vedelago <[email protected]> > >> >> > >> >> Just to add a concrete and simple example, that I gues will clarify > mi > >> >> situation. The following code creates two buffer on the host side, > one > >> >> pagelocked and the other a common one, and then copies/writes to a > gpu > >> >> buffer and evaluates performance using events for time measuring. > >> >> It's really simple indeed, there's no execution on multiple gpu, but > >> >> i would expect it to run in more or less the same time in the server > >> >> using just one of the Teslas. > >> >> However, it takes less than a second to run in my laptop and nearly > 15 > >> >> seconds on the server!!! > >> >> > >> >> import pycuda.driver as cuda > >> >> import pycuda.autoinit > >> >> import numpy as np > >> >> > >> >> def benchmark(up): > >> >> """ Up is a boolean flag. If set to True, benchmark is ran > >> >> copying > >> >> from > >> >> host to device; if false, the benchmark is ran the > >> >> other > >> >> way round > >> >> """ > >> >> > >> >> # Buffers size > >> >> size = 10*1024*1024 > >> >> > >> >> # Host and device buffer, equally-shaped. We don't care about > >> >> their contents > >> >> cpu_buff = np.empty(size, np.dtype('u1')) > >> >> cpu_locked_buff = cuda.pagelocked_empty(size, np.dtype('u1')) > >> >> gpu_buff = cuda.mem_alloc(cpu_buff.nbytes) > >> >> > >> >> # Events for measuring execution time; first two, for not > >> >> pinned > >> >> buffer, > >> >> # las 2 for pinned(locked) buffer > >> >> startn = cuda.Event() > >> >> endn = cuda.Event() > >> >> startl = cuda.Event() > >> >> endl = cuda.Event() > >> >> > >> >> if (up): > >> >> startn.record() > >> >> cuda.memcpy_htod(gpu_buff, cpu_buff) > >> >> endn.record() > >> >> endn.synchronize() > >> >> t1 = endn.time_since(startn) > >> >> > >> >> startl.record() > >> >> cuda.memcpy_htod(gpu_buff, cpu_locked_buff) > >> >> endl.record() > >> >> endl.synchronize() > >> >> t2 = endl.time_since(startl) > >> >> > >> >> print "From host to device benchmark results: \n" > >> >> print "Time for copying from normal host mem: %i > ms\n" > >> >> % > >> >> t1 > >> >> print "Time for copying from pinned host mem: %i > ms\n" > >> >> % > >> >> t2 > >> >> > >> >> diff = t1-t2 > >> >> if (diff > 0): > >> >> print "Copy from pinned memory was %i ms > >> >> faster\n" > >> >> % diff > >> >> else: > >> >> print "Copy from pinned memory was %i ms > >> >> slower\n" > >> >> % diff > >> >> > >> >> else: > >> >> startn.record() > >> >> cuda.memcpy_dtoh(cpu_buff, gpu_buff) > >> >> endn.record() > >> >> endn.synchronize() > >> >> t1 = endn.time_since(startn) > >> >> > >> >> startl.record() > >> >> cuda.memcpy_dtoh(cpu_locked_buff, gpu_buff) > >> >> endl.record() > >> >> endl.synchronize() > >> >> t2 = endl.time_since(startl) > >> >> > >> >> print "From device to host benchmark results: \n" > >> >> print "Time for copying to normal host mem: %i ms\n" > % > >> >> t1 > >> >> print "Time for copying to pinned host mem: %i ms\n" > % > >> >> t2 > >> >> > >> >> diff = t1-t2 > >> >> if (diff > 0): > >> >> print "Copy to pinned memory was %i ms > >> >> faster\n" % > >> >> diff > >> >> else: > >> >> print "Copy to pinned memory was %i ms > >> >> slower\n" % > >> >> diff > >> >> > >> >> benchmark(up=False) > >> >> > >> >> > >> >> On Mon, Jul 30, 2012 at 3:22 PM, Leandro Demarco Vedelago > >> >> <[email protected]> wrote: > >> >> > ---------- Forwarded message ---------- > >> >> > From: Leandro Demarco Vedelago <[email protected]> > >> >> > Date: Mon, Jul 30, 2012 at 2:57 PM > >> >> > Subject: Re: [PyCUDA] Performance Issues > >> >> > To: Brendan Wood <[email protected]>, [email protected] > >> >> > > >> >> > > >> >> > Brendan: > >> >> > Basically, all the examples are computing the dot product of 2 > large > >> >> > vectors. But in each example some new concept is introduced (pinned > >> >> > memory, streams, etc). > >> >> > The last example is the one that incorporates multiple-gpu. > >> >> > > >> >> > As for the work done, I am generating the data randomly and, making > >> >> > some tests at the end in the host side, which considerably > increases > >> >> > ex ecution time, but as this are "learning examples" I was not > >> >> > specially worried about it. But I would have expected that given > that > >> >> > the server has way more powerful hardware (the 3 teslas 2075 and 4 > >> >> > intel xeon with 6 cores each and 48 GB ram) programs would run > >> >> > faster, > >> >> > in particular this last example that is designed to work with > >> >> > multiples-gpu's. > >> >> > > >> >> > I compiled and ran the bandwith test and the queryDevice samples > from > >> >> > the SDK and they both passed, if that is what you meant. > >> >> > > >> >> > Now answering to Andreas: > >> >> > yes, I'm using one thread per each GPU (as the way it's done in the > >> >> > wiki example) and yes, the server has way more than 3 CPU's. As for > >> >> > the SCHED_BLOCKING_SYNC flag, should I pass it as an argument for > >> >> > each > >> >> > device context. What does this flag do? > >> >> > > >> >> > Thank you both for your answers > >> >> > > >> >> > On Mon, Jul 30, 2012 at 12:47 AM, Brendan Wood > >> >> > <[email protected]> > >> >> > wrote: > >> >> >> Hi Leandro, > >> >> >> > >> >> >> Without knowing exactly what examples you're running, it may be > hard > >> >> >> to > >> >> >> say what the problem is. In fact, you may not really have a > >> >> >> problem. > >> >> >> > >> >> >> How much work is being done in each example program? Is it enough > >> >> >> to > >> >> >> really work the GPU, or is communication and other overhead > >> >> >> dominating > >> >> >> runtime? Note that laptops may have lower communication latency > >> >> >> over > >> >> >> the PCI bus than desktops/servers, which can make small programs > run > >> >> >> much faster on laptops regardless of how much processing power the > >> >> >> GPU > >> >> >> has. > >> >> >> > >> >> >> Have you tried running the sample code from the SDK, so that you > can > >> >> >> verify that it's not a code problem? > >> >> >> > >> >> >> Regards, > >> >> >> > >> >> >> Brendan Wood > >> >> >> > >> >> >> > >> >> >> On Sun, 2012-07-29 at 23:59 -0300, Leandro Demarco Vedelago wrote: > >> >> >>> Hello: I've been reading and learning CUDA in the last few weeks > >> >> >>> and > >> >> >>> last week I started writing (translating to Pycuda from Cuda-C) > >> >> >>> some > >> >> >>> examples taken from the book "Cuda by Example". > >> >> >>> I started coding on a laptop with just one nvidia GPU (a gtx 560M > >> >> >>> if > >> >> >>> my memory is allright) with Windows 7. > >> >> >>> > >> >> >>> But in the project I'm currently working at, we intend to run > >> >> >>> (py)cuda > >> >> >>> on a multi-gpu server that has three Tesla C2075 cards. > >> >> >>> > >> >> >>> So I installed Ubuntu server 10.10 (with no GUI) and managed to > >> >> >>> install and get running the very same examples I ran on the > >> >> >>> single-gpu > >> >> >>> laptop. However they run really slow, in some cases it takes 3 > >> >> >>> times > >> >> >>> more than in the laptop. And this happens with most, if not all, > >> >> >>> the > >> >> >>> examples I wrote. > >> >> >>> > >> >> >>> I thought it could be a driver issue but I double-checked and > I've > >> >> >>> installed the correct ones, meaning those listed on the CUDA Zone > >> >> >>> section of nvidia.com for linux 64-bits. So I'm kind of lost > right > >> >> >>> now > >> >> >>> and was wondering if anyone has had this or somewhat similar > >> >> >>> problem > >> >> >>> running on a server. > >> >> >>> > >> >> >>> Sorry for the English, but it's not my native language. > >> >> >>> > >> >> >>> Thanks in advance, Leandro Demarco > >> >> >>> > >> >> >>> _______________________________________________ > >> >> >>> PyCUDA mailing list > >> >> >>> [email protected] > >> >> >>> http://lists.tiker.net/listinfo/pycuda > >> >> >> > >> >> >> > >> >> > >> >> _______________________________________________ > >> >> PyCUDA mailing list > >> >> [email protected] > >> >> http://lists.tiker.net/listinfo/pycuda > >> > > >> > > > > > >
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
