If you really want a simple benchmark for speed comparison I recommend a matrix multiplication example.
The thing you will really see when comparing the runtime of CUDA kernels to the runtime of equivalent CPU functions is the cost of transferring your data from CPU memory to GPU memory and back. For small datasets with little computation, you will see that the decrease in compute time when using CUDA is not enough to offset the overhead of doing the memory transfer. While with larger datasets that require intense computation on each piece of data, the decrease in compute time greatly outweighs the overhead of doing the memory transfer. Another interesting benchmark is to look at the runtime of the CUDA kernel broken down into time to copy data from CPU memory to GPU memory, time for GPU computation, and time to copy data from GPU memory back to CPU memory. I haven't tried this with the latest Kepler cards, but historically what you will see is a rather large fixed cost of doing the memory transfers. Many of the programs that see the greatest speed improvement are not only making use of the GPU for computation, but also acknowledge the memory transfer cost and do something clever to compensate for it. The fastest speedups are also achieved by making use of the special caches/memory types found on the card. In short, Your new Kepler hardware is much much faster than you think and the best results are achieved when hardware architecture is fully utilized in the application. Regards, Max On Wed, Jun 12, 2013 at 1:24 PM, Andreas Kloeckner <[email protected]>wrote: > Pierre Castellani <[email protected]> writes: > > I have bought kepler GPU in order to do some numerical calculation on it. > > > > I would like to use pyCuda (looks to me the best solution). > > > > Unfortunatly when I am running a test like > > MeasureGpuarraySpeedRandom > > < > http://wiki.tiker.net/PyCuda/Examples/MeasureGpuarraySpeedRandom?action=fullsearch&value=linkto%3A%22PyCuda%2FExamples%2FMeasureGpuarraySpeedRandom%22&context=180 > > > > > > I get the following results: > > Size |Time GPU |Size/Time GPU|Time CPU |Size/Time > > CPU|GPU vs CPU speedup > > > ---------+---------------+-------------+-----------------+-------------+------------------ > > 1024 > > > |0.0719905126953|14224.0965047|3.09289598465e-05|33108129.2446|0.000429625497701 > > > > 2048 > > > |0.0727789160156|28140.0179079|5.74035215378e-05|35677253.6795|0.000788738341822 > > > > 4096 |0.07278515625 |56275.2106478|0.00010898976326 > > |37581511.1208|0.00149741745261 > > 8192 > > > |0.0722379931641|113402.928863|0.000164551048279|49783942.9508|0.00227790171171 > > > > 16384 |0.0720771630859|227311.94318 > > |0.000254381122589|64407294.9802|0.00352928877467 > > 32768 |0.0722085107422|453796.923149|0.00044281665802 > > |73999022.8609|0.0061324718301 > > 65536 |0.0720480078125|909615.713047|0.000749320983887|87460516.133 > > |0.0104003012247 > > 131072 |0.0723209472656|1812365.64171|0.00153271682739 > > |85516122.5202|0.0211932626071 > > 262144 |0.0727287304688|3604407.75345|0.00305026916504 > > |85941268.0706|0.041940360369 > > 524288 |0.0723101269531|7250547.35888|0.00601688781738 > > |87136076.9741|0.0832094766101 > > 1048576 |0.0627352734375|16714297.1178|0.0123564978027 > > |84860291.0582|0.196962524042 > > 2097152 |0.0743136047363|28220297.0431|0.026837512207 > > |78142563.4322|0.361138613882 > > 4194304 |0.074144744873 |56569133.8905|0.0583531860352 > > |71877891.9367|0.787017153206 > > 8388608 |0.0736544189453|113891442.226|0.121150952148 > > |69240958.0877|1.64485653248 > > 16777216 |0.0743454406738|225665701.191|0.242345166016 > > |69228597.6891|3.2597179305 > > 33554432 |0.0765948486328|438076875.912|0.484589794922 > > |69242960.4412|6.32666300112 > > 67108864 |0.0805058410645|833589999.343|0.970654882812 |69137718.45 > > |12.0569497813 > > 134217728|0.0846059753418|1586385919.64|1.94103554688 > > |69147485.8439|22.9420621774 > > 268435456|0.094531427002 |2839642482.01|3.88270039062 > > |69136278.6189|41.0731173089 > > 536870912|0.111502416992 |4814881385.37|7.7108625 > > |69625273.6967|69.1542184286 > > > > > > I was not expecting fantastic result but not that bad. > > I've added a note to the documentation of the function you're using to > benchmark: > > http://documen.tician.de/pycuda/array.html#pycuda.curandom.rand > > That should answer your concerns. > > I'd like to have a word with whoever came up with the idea that this was > a valid benchmark. Random number generation is a bad problem to > use. Parallel RNGs are more complicated than sequential ones. So > claiming that both do the same amount of work is... mistaken. But even > neglecting this basic fact, the notion that all RNGs are somehow > comparable or do comparable amounts of work is also completely > off. There are subtle tradeoffs in how much work is done and how 'good' > (uncorrelated, ...) the RN sequence and its subsequences are: > > https://www.xkcd.com/221/ > > If you'd like to assess how viable GPUs and PyCUDA are, I'd suggest you > use a more well-defined workload, such as "compute 10^8 sines and > cosines", or, even better, the thing that you'd actually like to do. > > Andreas > > _______________________________________________ > PyCUDA mailing list > [email protected] > http://lists.tiker.net/listinfo/pycuda > -- Respectfully, Massimo 'Max' J. Becker Computer Scientist / Software Engineer Commercial Pilot - SEL/MEL (425)-239-1710
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
