If you really want a simple benchmark for speed comparison I recommend a
matrix multiplication example.

The thing you will really see when comparing the runtime of CUDA kernels to
the runtime of equivalent CPU functions is the cost of transferring your
data from CPU memory to GPU memory and back.

For small datasets with little computation, you will see that the decrease
in compute time when using CUDA is not enough to offset the overhead of
doing the memory transfer. While with larger datasets that require intense
computation on each piece of data, the decrease in compute time greatly
outweighs the overhead of doing the memory transfer.

Another interesting benchmark is to look at the runtime of the CUDA kernel
broken down into time to copy data from CPU memory to GPU memory, time for
GPU computation, and time to copy data from GPU memory back to CPU memory.
I haven't tried this with the latest Kepler cards, but historically what
you will see is a rather large fixed cost of doing the memory transfers.

Many of the programs that see the greatest speed improvement are not only
making use of the GPU for computation, but also acknowledge the memory
transfer cost and do something clever to compensate for it. The fastest
speedups are also achieved by making use of the special caches/memory types
found on the card.

In short,
Your new Kepler hardware is much much faster than you think and the best
results are achieved when hardware architecture is fully utilized in the
application.

Regards,
Max



On Wed, Jun 12, 2013 at 1:24 PM, Andreas Kloeckner
<[email protected]>wrote:

> Pierre Castellani <[email protected]> writes:
> > I have bought kepler GPU in order to do some numerical calculation on it.
> >
> > I would like to use pyCuda (looks to me the best solution).
> >
> > Unfortunatly when I am running a test like
> > MeasureGpuarraySpeedRandom
> > <
> http://wiki.tiker.net/PyCuda/Examples/MeasureGpuarraySpeedRandom?action=fullsearch&value=linkto%3A%22PyCuda%2FExamples%2FMeasureGpuarraySpeedRandom%22&context=180
> >
> >
> > I get the following results:
> > Size     |Time GPU       |Size/Time GPU|Time CPU         |Size/Time
> > CPU|GPU vs CPU speedup
> >
> ---------+---------------+-------------+-----------------+-------------+------------------
> > 1024
> >
> |0.0719905126953|14224.0965047|3.09289598465e-05|33108129.2446|0.000429625497701
> >
> > 2048
> >
> |0.0727789160156|28140.0179079|5.74035215378e-05|35677253.6795|0.000788738341822
> >
> > 4096     |0.07278515625  |56275.2106478|0.00010898976326
> > |37581511.1208|0.00149741745261
> > 8192
> >
> |0.0722379931641|113402.928863|0.000164551048279|49783942.9508|0.00227790171171
> >
> > 16384    |0.0720771630859|227311.94318
> > |0.000254381122589|64407294.9802|0.00352928877467
> > 32768    |0.0722085107422|453796.923149|0.00044281665802
> > |73999022.8609|0.0061324718301
> > 65536 |0.0720480078125|909615.713047|0.000749320983887|87460516.133
> > |0.0104003012247
> > 131072   |0.0723209472656|1812365.64171|0.00153271682739
> > |85516122.5202|0.0211932626071
> > 262144   |0.0727287304688|3604407.75345|0.00305026916504
> > |85941268.0706|0.041940360369
> > 524288   |0.0723101269531|7250547.35888|0.00601688781738
> > |87136076.9741|0.0832094766101
> > 1048576  |0.0627352734375|16714297.1178|0.0123564978027
> > |84860291.0582|0.196962524042
> > 2097152  |0.0743136047363|28220297.0431|0.026837512207
> > |78142563.4322|0.361138613882
> > 4194304  |0.074144744873 |56569133.8905|0.0583531860352
> > |71877891.9367|0.787017153206
> > 8388608  |0.0736544189453|113891442.226|0.121150952148
> > |69240958.0877|1.64485653248
> > 16777216 |0.0743454406738|225665701.191|0.242345166016
> > |69228597.6891|3.2597179305
> > 33554432 |0.0765948486328|438076875.912|0.484589794922
> > |69242960.4412|6.32666300112
> > 67108864 |0.0805058410645|833589999.343|0.970654882812 |69137718.45
> > |12.0569497813
> > 134217728|0.0846059753418|1586385919.64|1.94103554688
> > |69147485.8439|22.9420621774
> > 268435456|0.094531427002 |2839642482.01|3.88270039062
> > |69136278.6189|41.0731173089
> > 536870912|0.111502416992 |4814881385.37|7.7108625
> > |69625273.6967|69.1542184286
> >
> >
> > I was not expecting fantastic result but not that bad.
>
> I've added a note to the documentation of the function you're using to
> benchmark:
>
> http://documen.tician.de/pycuda/array.html#pycuda.curandom.rand
>
> That should answer your concerns.
>
> I'd like to have a word with whoever came up with the idea that this was
> a valid benchmark. Random number generation is a bad problem to
> use. Parallel RNGs are more complicated than sequential ones. So
> claiming that both do the same amount of work is... mistaken. But even
> neglecting this basic fact, the notion that all RNGs are somehow
> comparable or do comparable amounts of work is also completely
> off. There are subtle tradeoffs in how much work is done and how 'good'
> (uncorrelated, ...) the RN sequence and its subsequences are:
>
> https://www.xkcd.com/221/
>
> If you'd like to assess how viable GPUs and PyCUDA are, I'd suggest you
> use a more well-defined workload, such as "compute 10^8 sines and
> cosines", or, even better, the thing that you'd actually like to do.
>
> Andreas
>
> _______________________________________________
> PyCUDA mailing list
> [email protected]
> http://lists.tiker.net/listinfo/pycuda
>



-- 
Respectfully,
Massimo 'Max' J. Becker
Computer Scientist / Software Engineer
Commercial Pilot - SEL/MEL
(425)-239-1710
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to