Le 12/06/2013 22:40, Massimo Becker a écrit :
If you really want a simple benchmark for speed comparison I recommend
a matrix multiplication example.
The thing you will really see when comparing the runtime of CUDA
kernels to the runtime of equivalent CPU functions is the cost of
transferring your data from CPU memory to GPU memory and back.
For small datasets with little computation, you will see that the
decrease in compute time when using CUDA is not enough to offset the
overhead of doing the memory transfer. While with larger datasets that
require intense computation on each piece of data, the decrease in
compute time greatly outweighs the overhead of doing the memory transfer.
Another interesting benchmark is to look at the runtime of the CUDA
kernel broken down into time to copy data from CPU memory to GPU
memory, time for GPU computation, and time to copy data from GPU
memory back to CPU memory. I haven't tried this with the latest Kepler
cards, but historically what you will see is a rather large fixed cost
of doing the memory transfers.
Many of the programs that see the greatest speed improvement are not
only making use of the GPU for computation, but also acknowledge the
memory transfer cost and do something clever to compensate for it. The
fastest speedups are also achieved by making use of the special
caches/memory types found on the card.
In short,
Your new Kepler hardware is much much faster than you think and the
best results are achieved when hardware architecture is fully utilized
in the application.
Regards,
Max
On Wed, Jun 12, 2013 at 1:24 PM, Andreas Kloeckner
<[email protected] <mailto:[email protected]>> wrote:
Pierre Castellani <[email protected]
<mailto:[email protected]>> writes:
> I have bought kepler GPU in order to do some numerical
calculation on it.
>
> I would like to use pyCuda (looks to me the best solution).
>
> Unfortunatly when I am running a test like
> MeasureGpuarraySpeedRandom
>
<http://wiki.tiker.net/PyCuda/Examples/MeasureGpuarraySpeedRandom?action=fullsearch&value=linkto%3A%22PyCuda%2FExamples%2FMeasureGpuarraySpeedRandom%22&context=180>
>
> I get the following results:
> Size |Time GPU |Size/Time GPU|Time CPU |Size/Time
> CPU|GPU vs CPU speedup
>
---------+---------------+-------------+-----------------+-------------+------------------
> 1024
>
|0.0719905126953|14224.0965047|3.09289598465e-05|33108129.2446|0.000429625497701
>
> 2048
>
|0.0727789160156|28140.0179079|5.74035215378e-05|35677253.6795|0.000788738341822
>
> 4096 |0.07278515625 |56275.2106478|0.00010898976326
> |37581511.1208|0.00149741745261
> 8192
>
|0.0722379931641|113402.928863|0.000164551048279|49783942.9508|0.00227790171171
>
> 16384 |0.0720771630859|227311.94318
> |0.000254381122589|64407294.9802|0.00352928877467
> 32768 |0.0722085107422|453796.923149|0.00044281665802
> |73999022.8609|0.0061324718301
> 65536 |0.0720480078125|909615.713047|0.000749320983887|87460516.133
> |0.0104003012247
> 131072 |0.0723209472656|1812365.64171|0.00153271682739
> |85516122.5202|0.0211932626071
> 262144 |0.0727287304688|3604407.75345|0.00305026916504
> |85941268.0706|0.041940360369
> 524288 |0.0723101269531|7250547.35888|0.00601688781738
> |87136076.9741|0.0832094766101
> 1048576 |0.0627352734375|16714297.1178|0.0123564978027
> |84860291.0582|0.196962524042
> 2097152 |0.0743136047363|28220297.0431|0.026837512207
> |78142563.4322|0.361138613882
> 4194304 |0.074144744873 |56569133.8905|0.0583531860352
> |71877891.9367|0.787017153206
> 8388608 |0.0736544189453|113891442.226|0.121150952148
> |69240958.0877|1.64485653248
> 16777216 |0.0743454406738|225665701.191|0.242345166016
> |69228597.6891|3.2597179305
> 33554432 |0.0765948486328|438076875.912|0.484589794922
> |69242960.4412|6.32666300112
> 67108864 |0.0805058410645|833589999.343|0.970654882812 |69137718.45
> |12.0569497813
> 134217728|0.0846059753418|1586385919.64|1.94103554688
> |69147485.8439|22.9420621774
> 268435456|0.094531427002 |2839642482.01|3.88270039062
> |69136278.6189|41.0731173089
> 536870912|0.111502416992 |4814881385.37|7.7108625
> |69625273.6967|69.1542184286
>
>
> I was not expecting fantastic result but not that bad.
I've added a note to the documentation of the function you're
using to benchmark:
http://documen.tician.de/pycuda/array.html#pycuda.curandom.rand
That should answer your concerns.
I'd like to have a word with whoever came up with the idea that
this was
a valid benchmark. Random number generation is a bad problem to
use. Parallel RNGs are more complicated than sequential ones. So
claiming that both do the same amount of work is... mistaken. But even
neglecting this basic fact, the notion that all RNGs are somehow
comparable or do comparable amounts of work is also completely
off. There are subtle tradeoffs in how much work is done and how
'good'
(uncorrelated, ...) the RN sequence and its subsequences are:
https://www.xkcd.com/221/
If you'd like to assess how viable GPUs and PyCUDA are, I'd
suggest you
use a more well-defined workload, such as "compute 10^8 sines and
cosines", or, even better, the thing that you'd actually like to do.
Andreas
_______________________________________________
PyCUDA mailing list
[email protected] <mailto:[email protected]>
http://lists.tiker.net/listinfo/pycuda
--
Respectfully,
Massimo 'Max' J. Becker
Computer Scientist / Software Engineer
Commercial Pilot - SEL/MEL
(425)-239-1710
Thanks for all those advises and answers.
I will look deeply into the target computation that I shoud reach in
order to evaluate the performances win.
Thanks again,
Pierre.
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda