On Thu, Feb 12, 2009 at 8:19 AM, Michael Abshoff <michael.absh...@googlemail.com> wrote: > > No even close. The current generation peaks at around 1.2 TFlops single > precision, 280 GFlops double precision for ATI's hardware. The main > problem with those numbers is that the memory on the graphics card > cannot feed the data fast enough into the GPU to achieve theoretical > peak. So those hundreds of GFlops are pure marketing :) >
If your application is memory bandwidth limited, then yes you're not likely to see 100s of GFlops anytime soon. However, compute limited application can and do achieve 100s of GFlops on GPUs. Basic operations like FFTs and (level 3) BLAS are compute limited, as are the following applications: http://www.ks.uiuc.edu/Research/gpu/ http://www.dam.brown.edu/scicomp/scg-media/report_files/BrownSC-2008-27.pdf > So in reality you might get anywhere from 20% to 60% (if you are lucky) > locally before accounting for transfers from main memory to GPU memory > and so on. Given that recent Intel CPUs give you about 7 to 11 Glops > Double per core and libraries like ATLAS give you that performance today > without the need to jump through hoops these number start to look a lot > less impressive. You neglect to mention that CPUs, which have roughly 1/10th the memory bandwidth of high-end GPUs, are memory bound on the very same problems. You will not see 7 to 11 GFLops on a memory bound CPU code for the same reason you argue that GPUs don't achieve 100s of GFLops on memory bound GPU codes. In severely memory bound applications like sparse matrix-vector multiplication (i.e. A*x for sparse A) the best GPU performance you can expect is ~10 GFLops on the GPU and ~1 GFLop on the CPU (in double precision). We discuss this problem in the following tech report: http://forums.nvidia.com/index.php?showtopic=83825 It's true that host<->device transfers can be a bottleneck. In many cases, the solution is to simply leave the data resident on the GPU. For instance, you could imagine a variant of ndarray that held a pointer to a device array. Of course this requires that the other expensive parts of your algorithm also execute on the GPU so you're not shuttling data over the PCIe bus all the time. Full Disclosure: I'm a researcher at NVIDIA -- Nathan Bell wnb...@gmail.com http://graphics.cs.uiuc.edu/~wnbell/ _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion