Well.. *that* worked. ;-)

I have no clear I ideal what it all means... I'll have to track down some docs. 

<http://www.spvi.com/files/cuda_profile_1.log>

thanks for the tip!
-steve


On Jan 20, 2012, at 10:55 AM, Andreas Kloeckner wrote:

> Dear Steve,
> 
> On Thu, 19 Jan 2012 15:18:55 -0700, Steve Spicklemire <[email protected]> wrote:
>> First, thanks much for your reply. I tried the luxury dial.. (set it
>> to zero) and got a factor of 3 speedup! So that's encouraging. My
>> comparison is a similar approach with weave.inline, not threaded, all
> [ 22 more citation lines. Click/Enter to show. ]
>> CPU giving me 10**8 x,y pairs and computing pi in more like 2.8
>> seconds wall time.
>> 
>> <http://spvi.com/files/weave-monte-carlo>
>> 
>> <http://spvi.com/files/weave-mc-time>
>> 
>> I guess I was hoping for a significant speedup going to a GPU
>> approach. (note I'm naturally uninterested in the actual value of pi!
>> I'm just trying to figure out how to get results out of a GPU. I'm
>> building a small cluster with 6 baby GPUs and I'd like to get smart
>> about making use of the resource)
>> 
>> I'm also a little worried about the warning I'm getting about "can't
>> query SMD group size". Looking at the source it appears the platform
>> is returning "Apple" as a vendor, and that case is not treated in the
>> code that checks.. so it just returns None. When I run
>> 'dump_properties' I see that the max group size is pretty big!
>> 
>> <http://spvi.com/files/pyopencl-mc-lux0-time>
>> 
>> Anyway.. I'll try your idea of using enqueue_marker to try to track
>> down what's really taking the time. (I guess 60% of it *was*
>> generating excessivly luxurious random numbers!) But I still feel I
>> should be able to beat the CPU by quite a lot.
> 
> Set
> 
> export COMPUTE_PROFILE=1
> 
> and rerun your code. The driver will have written a profiler log file
> that breaks down what's using time on the GPU. (This might not be true
> on Apple CL if you're on a MacBook, not sure if that provides an
> equivalent facility. If you find out, please report back to the list.)
> 
> Next, take into account a GT330M lags by a factor of ~6-7 compared to a
> 'real' discrete GPU, firstly in mem bandwith (GT330M: 25 MB/s, good
> discrete chip: ~180 MB/s), and, less critically, in processing
> power. Also consider that your CPU can probably get to ~10 MB/s mem
> bandwidth if used well.
> 
> HTH,
> Andreas
> 


_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to