Andreas,

First, thanks much for your reply. I tried the luxury dial.. (set it to zero) 
and got a factor of 3 speedup! So that's encouraging. My comparison is a 
similar approach with weave.inline, not threaded, all CPU giving me 10**8 x,y 
pairs and computing pi in more like        2.8 seconds wall time.

<http://spvi.com/files/weave-monte-carlo>

<http://spvi.com/files/weave-mc-time>

I guess I was hoping for a significant speedup going to a GPU approach. (note 
I'm naturally uninterested in the actual value of pi! I'm just trying to figure 
out how to get results out of a GPU. I'm building a small cluster with 6 baby 
GPUs and I'd like to get smart about making use of the resource)

I'm also a little worried about the warning I'm getting about "can't query SMD 
group size". Looking at the source it appears the platform is returning "Apple" 
as a vendor, and that case is not treated in the code that checks.. so it just 
returns None. When I run 'dump_properties' I see that the max group size is 
pretty big! 

<http://spvi.com/files/pyopencl-mc-lux0-time>

Anyway.. I'll try your idea of using enqueue_marker to try to track down what's 
really taking the time. (I guess 60% of it *was* generating excessivly 
luxurious random numbers!) But I still feel I should be able to beat the CPU by 
quite a lot.

thanks,
-steve

On Jan 19, 2012, at 1:27 PM, Andreas Kloeckner <[email protected]> wrote:

> On Thu, 19 Jan 2012 12:30:54 -0700, Steve Spicklemire <[email protected]> wrote:
>> opencl/cuda Newbie here.. trying to use pyopencl/pycuda to learn my way 
>> around (use python a lot!) I have examples of what I've been trying to do to 
>> get familiar with the software. I'm trying to do an MC calculation of pi 
>> using the ReductionKernel. Here's what I've found:
>> 
>> <http://spvi.com/files/pyopencl-monte-carlo>
>> 
>> <http://spvi.com/files/pyopencl-mc-profile>
>> 
>> <http://spvi.com/files/pycuda-monte-carlo>
>> 
>> <http://spvi.com/files/pycuda-mc-profile>
>> 
>> I'm running on a macbook pro with GeForce GT 330M graphics.
>> 
>> I must be missing something basic. Both of these approaches are very
>> slow.
> 
> I.e. 10**8 samples in 15s, that's 6M samples/s. What's your reference
> value? Also note that clrandom has a 'luxury' value that can be turned
> down to get random numbers faster. Further, it might be good to know
> what part is slow. Python profiles are unfortunately unhelpful, as the
> GPU runs asynchronously and only blocks on the outbound data transfer
> (that's clearly visible in the CL profile, PyCUDA seems a bit more
> complicated).
> 
> Use cl.enqueue_marker with a profiling-enabled command queue to figure
> out what is actually taking the time, the reduction or the RNG.
> 
> HTH,
> Andreas
> 

_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to