Andreas, First, thanks much for your reply. I tried the luxury dial.. (set it to zero) and got a factor of 3 speedup! So that's encouraging. My comparison is a similar approach with weave.inline, not threaded, all CPU giving me 10**8 x,y pairs and computing pi in more like 2.8 seconds wall time.
<http://spvi.com/files/weave-monte-carlo> <http://spvi.com/files/weave-mc-time> I guess I was hoping for a significant speedup going to a GPU approach. (note I'm naturally uninterested in the actual value of pi! I'm just trying to figure out how to get results out of a GPU. I'm building a small cluster with 6 baby GPUs and I'd like to get smart about making use of the resource) I'm also a little worried about the warning I'm getting about "can't query SMD group size". Looking at the source it appears the platform is returning "Apple" as a vendor, and that case is not treated in the code that checks.. so it just returns None. When I run 'dump_properties' I see that the max group size is pretty big! <http://spvi.com/files/pyopencl-mc-lux0-time> Anyway.. I'll try your idea of using enqueue_marker to try to track down what's really taking the time. (I guess 60% of it *was* generating excessivly luxurious random numbers!) But I still feel I should be able to beat the CPU by quite a lot. thanks, -steve On Jan 19, 2012, at 1:27 PM, Andreas Kloeckner <[email protected]> wrote: > On Thu, 19 Jan 2012 12:30:54 -0700, Steve Spicklemire <[email protected]> wrote: >> opencl/cuda Newbie here.. trying to use pyopencl/pycuda to learn my way >> around (use python a lot!) I have examples of what I've been trying to do to >> get familiar with the software. I'm trying to do an MC calculation of pi >> using the ReductionKernel. Here's what I've found: >> >> <http://spvi.com/files/pyopencl-monte-carlo> >> >> <http://spvi.com/files/pyopencl-mc-profile> >> >> <http://spvi.com/files/pycuda-monte-carlo> >> >> <http://spvi.com/files/pycuda-mc-profile> >> >> I'm running on a macbook pro with GeForce GT 330M graphics. >> >> I must be missing something basic. Both of these approaches are very >> slow. > > I.e. 10**8 samples in 15s, that's 6M samples/s. What's your reference > value? Also note that clrandom has a 'luxury' value that can be turned > down to get random numbers faster. Further, it might be good to know > what part is slow. Python profiles are unfortunately unhelpful, as the > GPU runs asynchronously and only blocks on the outbound data transfer > (that's clearly visible in the CL profile, PyCUDA seems a bit more > complicated). > > Use cl.enqueue_marker with a profiling-enabled command queue to figure > out what is actually taking the time, the reduction or the RNG. > > HTH, > Andreas > _______________________________________________ PyOpenCL mailing list [email protected] http://lists.tiker.net/listinfo/pyopencl
