Apostolis, I'm not using X windows, as I did not install any GUI on the server
On Tue, Jul 31, 2012 at 11:46 AM, Apostolis Glenis <[email protected]> wrote: > maybe it has to do with the initialization of the GPU if another gpu is > responsible for X windows. > > > 2012/7/31 Leandro Demarco Vedelago <[email protected]> >> >> Just to add a concrete and simple example, that I gues will clarify mi >> situation. The following code creates two buffer on the host side, one >> pagelocked and the other a common one, and then copies/writes to a gpu >> buffer and evaluates performance using events for time measuring. >> It's really simple indeed, there's no execution on multiple gpu, but >> i would expect it to run in more or less the same time in the server >> using just one of the Teslas. >> However, it takes less than a second to run in my laptop and nearly 15 >> seconds on the server!!! >> >> import pycuda.driver as cuda >> import pycuda.autoinit >> import numpy as np >> >> def benchmark(up): >> """ Up is a boolean flag. If set to True, benchmark is ran copying >> from >> host to device; if false, the benchmark is ran the other >> way round >> """ >> >> # Buffers size >> size = 10*1024*1024 >> >> # Host and device buffer, equally-shaped. We don't care about >> their contents >> cpu_buff = np.empty(size, np.dtype('u1')) >> cpu_locked_buff = cuda.pagelocked_empty(size, np.dtype('u1')) >> gpu_buff = cuda.mem_alloc(cpu_buff.nbytes) >> >> # Events for measuring execution time; first two, for not pinned >> buffer, >> # las 2 for pinned(locked) buffer >> startn = cuda.Event() >> endn = cuda.Event() >> startl = cuda.Event() >> endl = cuda.Event() >> >> if (up): >> startn.record() >> cuda.memcpy_htod(gpu_buff, cpu_buff) >> endn.record() >> endn.synchronize() >> t1 = endn.time_since(startn) >> >> startl.record() >> cuda.memcpy_htod(gpu_buff, cpu_locked_buff) >> endl.record() >> endl.synchronize() >> t2 = endl.time_since(startl) >> >> print "From host to device benchmark results: \n" >> print "Time for copying from normal host mem: %i ms\n" % >> t1 >> print "Time for copying from pinned host mem: %i ms\n" % >> t2 >> >> diff = t1-t2 >> if (diff > 0): >> print "Copy from pinned memory was %i ms faster\n" >> % diff >> else: >> print "Copy from pinned memory was %i ms slower\n" >> % diff >> >> else: >> startn.record() >> cuda.memcpy_dtoh(cpu_buff, gpu_buff) >> endn.record() >> endn.synchronize() >> t1 = endn.time_since(startn) >> >> startl.record() >> cuda.memcpy_dtoh(cpu_locked_buff, gpu_buff) >> endl.record() >> endl.synchronize() >> t2 = endl.time_since(startl) >> >> print "From device to host benchmark results: \n" >> print "Time for copying to normal host mem: %i ms\n" % t1 >> print "Time for copying to pinned host mem: %i ms\n" % t2 >> >> diff = t1-t2 >> if (diff > 0): >> print "Copy to pinned memory was %i ms faster\n" % >> diff >> else: >> print "Copy to pinned memory was %i ms slower\n" % >> diff >> >> benchmark(up=False) >> >> >> On Mon, Jul 30, 2012 at 3:22 PM, Leandro Demarco Vedelago >> <[email protected]> wrote: >> > ---------- Forwarded message ---------- >> > From: Leandro Demarco Vedelago <[email protected]> >> > Date: Mon, Jul 30, 2012 at 2:57 PM >> > Subject: Re: [PyCUDA] Performance Issues >> > To: Brendan Wood <[email protected]>, [email protected] >> > >> > >> > Brendan: >> > Basically, all the examples are computing the dot product of 2 large >> > vectors. But in each example some new concept is introduced (pinned >> > memory, streams, etc). >> > The last example is the one that incorporates multiple-gpu. >> > >> > As for the work done, I am generating the data randomly and, making >> > some tests at the end in the host side, which considerably increases >> > ex ecution time, but as this are "learning examples" I was not >> > specially worried about it. But I would have expected that given that >> > the server has way more powerful hardware (the 3 teslas 2075 and 4 >> > intel xeon with 6 cores each and 48 GB ram) programs would run faster, >> > in particular this last example that is designed to work with >> > multiples-gpu's. >> > >> > I compiled and ran the bandwith test and the queryDevice samples from >> > the SDK and they both passed, if that is what you meant. >> > >> > Now answering to Andreas: >> > yes, I'm using one thread per each GPU (as the way it's done in the >> > wiki example) and yes, the server has way more than 3 CPU's. As for >> > the SCHED_BLOCKING_SYNC flag, should I pass it as an argument for each >> > device context. What does this flag do? >> > >> > Thank you both for your answers >> > >> > On Mon, Jul 30, 2012 at 12:47 AM, Brendan Wood <[email protected]> >> > wrote: >> >> Hi Leandro, >> >> >> >> Without knowing exactly what examples you're running, it may be hard to >> >> say what the problem is. In fact, you may not really have a problem. >> >> >> >> How much work is being done in each example program? Is it enough to >> >> really work the GPU, or is communication and other overhead dominating >> >> runtime? Note that laptops may have lower communication latency over >> >> the PCI bus than desktops/servers, which can make small programs run >> >> much faster on laptops regardless of how much processing power the GPU >> >> has. >> >> >> >> Have you tried running the sample code from the SDK, so that you can >> >> verify that it's not a code problem? >> >> >> >> Regards, >> >> >> >> Brendan Wood >> >> >> >> >> >> On Sun, 2012-07-29 at 23:59 -0300, Leandro Demarco Vedelago wrote: >> >>> Hello: I've been reading and learning CUDA in the last few weeks and >> >>> last week I started writing (translating to Pycuda from Cuda-C) some >> >>> examples taken from the book "Cuda by Example". >> >>> I started coding on a laptop with just one nvidia GPU (a gtx 560M if >> >>> my memory is allright) with Windows 7. >> >>> >> >>> But in the project I'm currently working at, we intend to run (py)cuda >> >>> on a multi-gpu server that has three Tesla C2075 cards. >> >>> >> >>> So I installed Ubuntu server 10.10 (with no GUI) and managed to >> >>> install and get running the very same examples I ran on the single-gpu >> >>> laptop. However they run really slow, in some cases it takes 3 times >> >>> more than in the laptop. And this happens with most, if not all, the >> >>> examples I wrote. >> >>> >> >>> I thought it could be a driver issue but I double-checked and I've >> >>> installed the correct ones, meaning those listed on the CUDA Zone >> >>> section of nvidia.com for linux 64-bits. So I'm kind of lost right now >> >>> and was wondering if anyone has had this or somewhat similar problem >> >>> running on a server. >> >>> >> >>> Sorry for the English, but it's not my native language. >> >>> >> >>> Thanks in advance, Leandro Demarco >> >>> >> >>> _______________________________________________ >> >>> PyCUDA mailing list >> >>> [email protected] >> >>> http://lists.tiker.net/listinfo/pycuda >> >> >> >> >> >> _______________________________________________ >> PyCUDA mailing list >> [email protected] >> http://lists.tiker.net/listinfo/pycuda > > _______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
