Apostolis, I'm not using X windows, as I did not install any GUI on the server

On Tue, Jul 31, 2012 at 11:46 AM, Apostolis Glenis
<[email protected]> wrote:
> maybe it has to do with the initialization of the GPU if another gpu is
> responsible for X windows.
>
>
> 2012/7/31 Leandro Demarco Vedelago <[email protected]>
>>
>> Just to add a concrete and simple example, that I gues will clarify mi
>> situation. The following code creates two buffer on the host side, one
>> pagelocked and the other a common one, and then copies/writes to a gpu
>> buffer and evaluates performance using events for time measuring.
>> It's really simple indeed, there's no execution on multiple gpu,  but
>> i would expect it to run in more or less the same time in the server
>> using just one of the Teslas.
>> However, it takes less than a second to run in my laptop and nearly 15
>> seconds on the server!!!
>>
>> import pycuda.driver as cuda
>> import pycuda.autoinit
>> import numpy as np
>>
>> def benchmark(up):
>>         """ Up is a boolean flag. If set to True, benchmark is ran copying
>> from
>>                 host to device; if false, the benchmark is ran the other
>> way round
>>         """
>>
>>         # Buffers size
>>         size = 10*1024*1024
>>
>>         # Host and device buffer, equally-shaped. We don't care about
>> their contents
>>         cpu_buff = np.empty(size, np.dtype('u1'))
>>         cpu_locked_buff = cuda.pagelocked_empty(size, np.dtype('u1'))
>>         gpu_buff = cuda.mem_alloc(cpu_buff.nbytes)
>>
>>         # Events for measuring execution time; first two, for not pinned
>> buffer,
>>         # las 2 for pinned(locked) buffer
>>         startn = cuda.Event()
>>         endn = cuda.Event()
>>         startl = cuda.Event()
>>         endl = cuda.Event()
>>
>>         if (up):
>>                 startn.record()
>>                 cuda.memcpy_htod(gpu_buff, cpu_buff)
>>                 endn.record()
>>                 endn.synchronize()
>>                 t1 = endn.time_since(startn)
>>
>>                 startl.record()
>>                 cuda.memcpy_htod(gpu_buff, cpu_locked_buff)
>>                 endl.record()
>>                 endl.synchronize()
>>                 t2 = endl.time_since(startl)
>>
>>                 print "From host to device benchmark results: \n"
>>                 print "Time for copying from normal host mem: %i ms\n" %
>> t1
>>                 print "Time for copying from pinned host mem: %i ms\n" %
>> t2
>>
>>                 diff = t1-t2
>>                 if (diff > 0):
>>                         print "Copy from pinned memory was %i ms faster\n"
>> % diff
>>                 else:
>>                         print "Copy from pinned memory was %i ms slower\n"
>> % diff
>>
>>         else:
>>                 startn.record()
>>                 cuda.memcpy_dtoh(cpu_buff, gpu_buff)
>>                 endn.record()
>>                 endn.synchronize()
>>                 t1 = endn.time_since(startn)
>>
>>                 startl.record()
>>                 cuda.memcpy_dtoh(cpu_locked_buff, gpu_buff)
>>                 endl.record()
>>                 endl.synchronize()
>>                 t2 = endl.time_since(startl)
>>
>>                 print "From device to host benchmark results: \n"
>>                 print "Time for copying to normal host mem: %i ms\n" % t1
>>                 print "Time for copying to pinned host mem: %i ms\n" % t2
>>
>>                 diff = t1-t2
>>                 if (diff > 0):
>>                         print "Copy to pinned memory was %i ms faster\n" %
>> diff
>>                 else:
>>                         print "Copy to pinned memory was %i ms slower\n" %
>> diff
>>
>> benchmark(up=False)
>>
>>
>> On Mon, Jul 30, 2012 at 3:22 PM, Leandro Demarco Vedelago
>> <[email protected]> wrote:
>> > ---------- Forwarded message ----------
>> > From: Leandro Demarco Vedelago <[email protected]>
>> > Date: Mon, Jul 30, 2012 at 2:57 PM
>> > Subject: Re: [PyCUDA] Performance Issues
>> > To: Brendan Wood <[email protected]>, [email protected]
>> >
>> >
>> > Brendan:
>> > Basically, all the examples are computing the dot product of 2 large
>> > vectors. But in each example some new concept is introduced (pinned
>> > memory, streams, etc).
>> > The last example is the one that incorporates multiple-gpu.
>> >
>> > As for the work done, I am generating the data randomly and, making
>> > some tests at the end in the host side, which considerably increases
>> > ex ecution time, but as this are "learning examples" I was not
>> > specially worried about it. But I would have expected that given that
>> > the server has way more powerful hardware (the 3 teslas 2075 and 4
>> > intel xeon with 6 cores each and 48 GB ram) programs would run faster,
>> > in particular this last example that is designed to work with
>> > multiples-gpu's.
>> >
>> > I compiled and ran the bandwith test and the queryDevice samples from
>> > the SDK and they both passed, if that is what you meant.
>> >
>> > Now answering to Andreas:
>> > yes, I'm using one thread per each GPU (as the way it's done in the
>> > wiki example) and yes, the server has way more than 3 CPU's. As for
>> > the SCHED_BLOCKING_SYNC flag, should I pass it as an argument for each
>> > device context. What does this flag do?
>> >
>> > Thank you both for your answers
>> >
>> > On Mon, Jul 30, 2012 at 12:47 AM, Brendan Wood <[email protected]>
>> > wrote:
>> >> Hi Leandro,
>> >>
>> >> Without knowing exactly what examples you're running, it may be hard to
>> >> say what the problem is.  In fact, you may not really have a problem.
>> >>
>> >> How much work is being done in each example program?  Is it enough to
>> >> really work the GPU, or is communication and other overhead dominating
>> >> runtime?  Note that laptops may have lower communication latency over
>> >> the PCI bus than desktops/servers, which can make small programs run
>> >> much faster on laptops regardless of how much processing power the GPU
>> >> has.
>> >>
>> >> Have you tried running the sample code from the SDK, so that you can
>> >> verify that it's not a code problem?
>> >>
>> >> Regards,
>> >>
>> >> Brendan Wood
>> >>
>> >>
>> >> On Sun, 2012-07-29 at 23:59 -0300, Leandro Demarco Vedelago wrote:
>> >>> Hello: I've been reading and learning CUDA in the last few weeks and
>> >>> last week I started writing (translating to Pycuda from Cuda-C) some
>> >>> examples taken from the book "Cuda by Example".
>> >>> I started coding on a laptop with just one nvidia GPU (a gtx 560M if
>> >>> my memory is allright) with Windows 7.
>> >>>
>> >>> But in the project I'm currently working at, we intend to run (py)cuda
>> >>> on a multi-gpu server that has three Tesla C2075 cards.
>> >>>
>> >>> So I installed Ubuntu server 10.10 (with no  GUI) and managed to
>> >>> install and get running the very same examples I ran on the single-gpu
>> >>> laptop. However they run really slow, in some cases it takes 3 times
>> >>> more than in the laptop. And this happens with most, if not all, the
>> >>> examples I wrote.
>> >>>
>> >>> I thought it could be a driver issue but I double-checked and I've
>> >>> installed the correct ones, meaning those listed on the CUDA Zone
>> >>> section of nvidia.com for linux 64-bits. So I'm kind of lost right now
>> >>> and was wondering if anyone has had this or somewhat similar problem
>> >>> running on a server.
>> >>>
>> >>> Sorry for the English, but it's not my native language.
>> >>>
>> >>> Thanks in advance, Leandro Demarco
>> >>>
>> >>> _______________________________________________
>> >>> PyCUDA mailing list
>> >>> [email protected]
>> >>> http://lists.tiker.net/listinfo/pycuda
>> >>
>> >>
>>
>> _______________________________________________
>> PyCUDA mailing list
>> [email protected]
>> http://lists.tiker.net/listinfo/pycuda
>
>

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to