Well.. *that* worked. ;-) I have no clear I ideal what it all means... I'll have to track down some docs.
<http://www.spvi.com/files/cuda_profile_1.log> thanks for the tip! -steve On Jan 20, 2012, at 10:55 AM, Andreas Kloeckner wrote: > Dear Steve, > > On Thu, 19 Jan 2012 15:18:55 -0700, Steve Spicklemire <[email protected]> wrote: >> First, thanks much for your reply. I tried the luxury dial.. (set it >> to zero) and got a factor of 3 speedup! So that's encouraging. My >> comparison is a similar approach with weave.inline, not threaded, all > [ 22 more citation lines. Click/Enter to show. ] >> CPU giving me 10**8 x,y pairs and computing pi in more like 2.8 >> seconds wall time. >> >> <http://spvi.com/files/weave-monte-carlo> >> >> <http://spvi.com/files/weave-mc-time> >> >> I guess I was hoping for a significant speedup going to a GPU >> approach. (note I'm naturally uninterested in the actual value of pi! >> I'm just trying to figure out how to get results out of a GPU. I'm >> building a small cluster with 6 baby GPUs and I'd like to get smart >> about making use of the resource) >> >> I'm also a little worried about the warning I'm getting about "can't >> query SMD group size". Looking at the source it appears the platform >> is returning "Apple" as a vendor, and that case is not treated in the >> code that checks.. so it just returns None. When I run >> 'dump_properties' I see that the max group size is pretty big! >> >> <http://spvi.com/files/pyopencl-mc-lux0-time> >> >> Anyway.. I'll try your idea of using enqueue_marker to try to track >> down what's really taking the time. (I guess 60% of it *was* >> generating excessivly luxurious random numbers!) But I still feel I >> should be able to beat the CPU by quite a lot. > > Set > > export COMPUTE_PROFILE=1 > > and rerun your code. The driver will have written a profiler log file > that breaks down what's using time on the GPU. (This might not be true > on Apple CL if you're on a MacBook, not sure if that provides an > equivalent facility. If you find out, please report back to the list.) > > Next, take into account a GT330M lags by a factor of ~6-7 compared to a > 'real' discrete GPU, firstly in mem bandwith (GT330M: 25 MB/s, good > discrete chip: ~180 MB/s), and, less critically, in processing > power. Also consider that your CPU can probably get to ~10 MB/s mem > bandwidth if used well. > > HTH, > Andreas > _______________________________________________ PyOpenCL mailing list [email protected] http://lists.tiker.net/listinfo/pyopencl
