Re: [PyOpenCL] Trouble understanding/applying ReductionKernel

Andreas Kloeckner Fri, 20 Jan 2012 09:55:38 -0800

Dear Steve,

On Thu, 19 Jan 2012 15:18:55 -0700, Steve Spicklemire <[email protected]> wrote:
> First, thanks much for your reply. I tried the luxury dial.. (set it
> to zero) and got a factor of 3 speedup! So that's encouraging. My
> comparison is a similar approach with weave.inline, not threaded, all
[ 22 more citation lines. Click/Enter to show. ]
> CPU giving me 10**8 x,y pairs and computing pi in more like 2.8
> seconds wall time.
> 
> <http://spvi.com/files/weave-monte-carlo>
> 
> <http://spvi.com/files/weave-mc-time>
> 
> I guess I was hoping for a significant speedup going to a GPU
> approach. (note I'm naturally uninterested in the actual value of pi!
> I'm just trying to figure out how to get results out of a GPU. I'm
> building a small cluster with 6 baby GPUs and I'd like to get smart
> about making use of the resource)
> 
> I'm also a little worried about the warning I'm getting about "can't
> query SMD group size". Looking at the source it appears the platform
> is returning "Apple" as a vendor, and that case is not treated in the
> code that checks.. so it just returns None. When I run
> 'dump_properties' I see that the max group size is pretty big!
> 
> <http://spvi.com/files/pyopencl-mc-lux0-time>
> 
> Anyway.. I'll try your idea of using enqueue_marker to try to track
> down what's really taking the time. (I guess 60% of it *was*
> generating excessivly luxurious random numbers!) But I still feel I
> should be able to beat the CPU by quite a lot.


Set

export COMPUTE_PROFILE=1

and rerun your code. The driver will have written a profiler log file
that breaks down what's using time on the GPU. (This might not be true
on Apple CL if you're on a MacBook, not sure if that provides an
equivalent facility. If you find out, please report back to the list.)

Next, take into account a GT330M lags by a factor of ~6-7 compared to a
'real' discrete GPU, firstly in mem bandwith (GT330M: 25 MB/s, good
discrete chip: ~180 MB/s), and, less critically, in processing
power. Also consider that your CPU can probably get to ~10 MB/s mem
bandwidth if used well.

HTH,
Andreas

pgpfd44NTDd7g.pgp
Description: PGP signature

_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Re: [PyOpenCL] Trouble understanding/applying ReductionKernel

Reply via email to