Re: [PyCUDA] Timing using GTX TITAN

Pierre Castellani Wed, 12 Jun 2013 15:17:27 -0700

Le 12/06/2013 22:40, Massimo Becker a écrit :

If you really want a simple benchmark for speed comparison I recommenda matrix multiplication example.

The thing you will really see when comparing the runtime of CUDAkernels to the runtime of equivalent CPU functions is the cost oftransferring your data from CPU memory to GPU memory and back.

For small datasets with little computation, you will see that thedecrease in compute time when using CUDA is not enough to offset theoverhead of doing the memory transfer. While with larger datasets thatrequire intense computation on each piece of data, the decrease incompute time greatly outweighs the overhead of doing the memory transfer.

Another interesting benchmark is to look at the runtime of the CUDAkernel broken down into time to copy data from CPU memory to GPUmemory, time for GPU computation, and time to copy data from GPUmemory back to CPU memory. I haven't tried this with the latest Keplercards, but historically what you will see is a rather large fixed costof doing the memory transfers.

Many of the programs that see the greatest speed improvement are notonly making use of the GPU for computation, but also acknowledge thememory transfer cost and do something clever to compensate for it. Thefastest speedups are also achieved by making use of the specialcaches/memory types found on the card.


In short,

Your new Kepler hardware is much much faster than you think and thebest results are achieved when hardware architecture is fully utilizedin the application.


Regards,
Max

On Wed, Jun 12, 2013 at 1:24 PM, Andreas Kloeckner<[email protected] <mailto:[email protected]>> wrote:


    Pierre Castellani <[email protected]
    <mailto:[email protected]>> writes:
    > I have bought kepler GPU in order to do some numerical
    calculation on it.
    >
    > I would like to use pyCuda (looks to me the best solution).
    >
    > Unfortunatly when I am running a test like
    > MeasureGpuarraySpeedRandom
    >
    
<http://wiki.tiker.net/PyCuda/Examples/MeasureGpuarraySpeedRandom?action=fullsearch&value=linkto%3A%22PyCuda%2FExamples%2FMeasureGpuarraySpeedRandom%22&context=180>
    >
    > I get the following results:
    > Size     |Time GPU       |Size/Time GPU|Time CPU       |Size/Time
    > CPU|GPU vs CPU speedup
    >
    
---------+---------------+-------------+-----------------+-------------+------------------
    > 1024
    >
    
|0.0719905126953|14224.0965047|3.09289598465e-05|33108129.2446|0.000429625497701
    >
    > 2048
    >
    
|0.0727789160156|28140.0179079|5.74035215378e-05|35677253.6795|0.000788738341822
    >
    > 4096     |0.07278515625  |56275.2106478|0.00010898976326
    > |37581511.1208|0.00149741745261
    > 8192
    >
    
|0.0722379931641|113402.928863|0.000164551048279|49783942.9508|0.00227790171171
    >
    > 16384    |0.0720771630859|227311.94318
    > |0.000254381122589|64407294.9802|0.00352928877467
    > 32768  |0.0722085107422|453796.923149|0.00044281665802
    > |73999022.8609|0.0061324718301
    > 65536 |0.0720480078125|909615.713047|0.000749320983887|87460516.133
    > |0.0104003012247
    > 131072 |0.0723209472656|1812365.64171|0.00153271682739
    > |85516122.5202|0.0211932626071
    > 262144 |0.0727287304688|3604407.75345|0.00305026916504
    > |85941268.0706|0.041940360369
    > 524288 |0.0723101269531|7250547.35888|0.00601688781738
    > |87136076.9741|0.0832094766101
    > 1048576  |0.0627352734375|16714297.1178|0.0123564978027
    > |84860291.0582|0.196962524042
    > 2097152  |0.0743136047363|28220297.0431|0.026837512207
    > |78142563.4322|0.361138613882
    > 4194304  |0.074144744873 |56569133.8905|0.0583531860352
    > |71877891.9367|0.787017153206
    > 8388608  |0.0736544189453|113891442.226|0.121150952148
    > |69240958.0877|1.64485653248
    > 16777216 |0.0743454406738|225665701.191|0.242345166016
    > |69228597.6891|3.2597179305
    > 33554432 |0.0765948486328|438076875.912|0.484589794922
    > |69242960.4412|6.32666300112
    > 67108864 |0.0805058410645|833589999.343|0.970654882812 |69137718.45
    > |12.0569497813
    > 134217728|0.0846059753418|1586385919.64|1.94103554688
    > |69147485.8439|22.9420621774
    > 268435456|0.094531427002 |2839642482.01|3.88270039062
    > |69136278.6189|41.0731173089
    > 536870912|0.111502416992 |4814881385.37|7.7108625
    > |69625273.6967|69.1542184286
    >
    >
    > I was not expecting fantastic result but not that bad.

    I've added a note to the documentation of the function you're
    using to benchmark:

    http://documen.tician.de/pycuda/array.html#pycuda.curandom.rand

    That should answer your concerns.

    I'd like to have a word with whoever came up with the idea that
    this was
    a valid benchmark. Random number generation is a bad problem to
    use. Parallel RNGs are more complicated than sequential ones. So
    claiming that both do the same amount of work is... mistaken. But even
    neglecting this basic fact, the notion that all RNGs are somehow
    comparable or do comparable amounts of work is also completely
    off. There are subtle tradeoffs in how much work is done and how
    'good'
    (uncorrelated, ...) the RN sequence and its subsequences are:

    https://www.xkcd.com/221/

    If you'd like to assess how viable GPUs and PyCUDA are, I'd
    suggest you
    use a more well-defined workload, such as "compute 10^8 sines and
    cosines", or, even better, the thing that you'd actually like to do.

    Andreas

    _______________________________________________
    PyCUDA mailing list
    [email protected] <mailto:[email protected]>
    http://lists.tiker.net/listinfo/pycuda




--
Respectfully,
Massimo 'Max' J. Becker
Computer Scientist / Software Engineer
Commercial Pilot - SEL/MEL
(425)-239-1710


Thanks for all those advises and answers.

I will look deeply into the target computation that I shoud reach inorder to evaluate the performances win.


Thanks again,
Pierre.

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] Timing using GTX TITAN

Reply via email to