Le 12/06/2013 22:40, Massimo Becker a écrit :
If you really want a simple benchmark for speed comparison I recommend a matrix multiplication example.

The thing you will really see when comparing the runtime of CUDA kernels to the runtime of equivalent CPU functions is the cost of transferring your data from CPU memory to GPU memory and back.

For small datasets with little computation, you will see that the decrease in compute time when using CUDA is not enough to offset the overhead of doing the memory transfer. While with larger datasets that require intense computation on each piece of data, the decrease in compute time greatly outweighs the overhead of doing the memory transfer.

Another interesting benchmark is to look at the runtime of the CUDA kernel broken down into time to copy data from CPU memory to GPU memory, time for GPU computation, and time to copy data from GPU memory back to CPU memory. I haven't tried this with the latest Kepler cards, but historically what you will see is a rather large fixed cost of doing the memory transfers.

Many of the programs that see the greatest speed improvement are not only making use of the GPU for computation, but also acknowledge the memory transfer cost and do something clever to compensate for it. The fastest speedups are also achieved by making use of the special caches/memory types found on the card.

In short,
Your new Kepler hardware is much much faster than you think and the best results are achieved when hardware architecture is fully utilized in the application.

Regards,
Max



On Wed, Jun 12, 2013 at 1:24 PM, Andreas Kloeckner <[email protected] <mailto:[email protected]>> wrote:

    Pierre Castellani <[email protected]
    <mailto:[email protected]>> writes:
    > I have bought kepler GPU in order to do some numerical
    calculation on it.
    >
    > I would like to use pyCuda (looks to me the best solution).
    >
    > Unfortunatly when I am running a test like
    > MeasureGpuarraySpeedRandom
    >
    
<http://wiki.tiker.net/PyCuda/Examples/MeasureGpuarraySpeedRandom?action=fullsearch&value=linkto%3A%22PyCuda%2FExamples%2FMeasureGpuarraySpeedRandom%22&context=180>
    >
    > I get the following results:
    > Size     |Time GPU       |Size/Time GPU|Time CPU       |Size/Time
    > CPU|GPU vs CPU speedup
    >
    
---------+---------------+-------------+-----------------+-------------+------------------
    > 1024
    >
    
|0.0719905126953|14224.0965047|3.09289598465e-05|33108129.2446|0.000429625497701
    >
    > 2048
    >
    
|0.0727789160156|28140.0179079|5.74035215378e-05|35677253.6795|0.000788738341822
    >
    > 4096     |0.07278515625  |56275.2106478|0.00010898976326
    > |37581511.1208|0.00149741745261
    > 8192
    >
    
|0.0722379931641|113402.928863|0.000164551048279|49783942.9508|0.00227790171171
    >
    > 16384    |0.0720771630859|227311.94318
    > |0.000254381122589|64407294.9802|0.00352928877467
    > 32768  |0.0722085107422|453796.923149|0.00044281665802
    > |73999022.8609|0.0061324718301
    > 65536 |0.0720480078125|909615.713047|0.000749320983887|87460516.133
    > |0.0104003012247
    > 131072 |0.0723209472656|1812365.64171|0.00153271682739
    > |85516122.5202|0.0211932626071
    > 262144 |0.0727287304688|3604407.75345|0.00305026916504
    > |85941268.0706|0.041940360369
    > 524288 |0.0723101269531|7250547.35888|0.00601688781738
    > |87136076.9741|0.0832094766101
    > 1048576  |0.0627352734375|16714297.1178|0.0123564978027
    > |84860291.0582|0.196962524042
    > 2097152  |0.0743136047363|28220297.0431|0.026837512207
    > |78142563.4322|0.361138613882
    > 4194304  |0.074144744873 |56569133.8905|0.0583531860352
    > |71877891.9367|0.787017153206
    > 8388608  |0.0736544189453|113891442.226|0.121150952148
    > |69240958.0877|1.64485653248
    > 16777216 |0.0743454406738|225665701.191|0.242345166016
    > |69228597.6891|3.2597179305
    > 33554432 |0.0765948486328|438076875.912|0.484589794922
    > |69242960.4412|6.32666300112
    > 67108864 |0.0805058410645|833589999.343|0.970654882812 |69137718.45
    > |12.0569497813
    > 134217728|0.0846059753418|1586385919.64|1.94103554688
    > |69147485.8439|22.9420621774
    > 268435456|0.094531427002 |2839642482.01|3.88270039062
    > |69136278.6189|41.0731173089
    > 536870912|0.111502416992 |4814881385.37|7.7108625
    > |69625273.6967|69.1542184286
    >
    >
    > I was not expecting fantastic result but not that bad.

    I've added a note to the documentation of the function you're
    using to benchmark:

    http://documen.tician.de/pycuda/array.html#pycuda.curandom.rand

    That should answer your concerns.

    I'd like to have a word with whoever came up with the idea that
    this was
    a valid benchmark. Random number generation is a bad problem to
    use. Parallel RNGs are more complicated than sequential ones. So
    claiming that both do the same amount of work is... mistaken. But even
    neglecting this basic fact, the notion that all RNGs are somehow
    comparable or do comparable amounts of work is also completely
    off. There are subtle tradeoffs in how much work is done and how
    'good'
    (uncorrelated, ...) the RN sequence and its subsequences are:

    https://www.xkcd.com/221/

    If you'd like to assess how viable GPUs and PyCUDA are, I'd
    suggest you
    use a more well-defined workload, such as "compute 10^8 sines and
    cosines", or, even better, the thing that you'd actually like to do.

    Andreas

    _______________________________________________
    PyCUDA mailing list
    [email protected] <mailto:[email protected]>
    http://lists.tiker.net/listinfo/pycuda




--
Respectfully,
Massimo 'Max' J. Becker
Computer Scientist / Software Engineer
Commercial Pilot - SEL/MEL
(425)-239-1710

Thanks for all those advises and answers.

I will look deeply into the target computation that I shoud reach in order to evaluate the performances win.

Thanks again,
Pierre.
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to