Re: [PyOpenCL] Trouble understanding/applying ReductionKernel

Steve Spicklemire Fri, 27 Jan 2012 07:59:28 -0800

Two more quick points...

If I let the code keep running on the ION2 system I get this:


<http://www.spvi.com/files/bccd-out-9.txt>

And... if I set the environment variable to show compiler output on the ION2 
system.. I see this.

<http://www.spvi.com/files/bccd-compiler-output-9.txt>

I'm struggling to interpret what that all means. ;-)

Any hints appreciated.

BTW... is there a 'release' memory method needed when using pyopencl?

Do I need to create my context/queue only once and pass it around to be reused 
all the time?

thanks,
-steve

On Jan 27, 2012, at 6:24 AM, Steve Spicklemire <[email protected]> wrote:

> Hi Folks,
> 
> More on this saga. ;-)
> 
> Short story.. I *think* I'm having memory management trouble... but I'm not 
> sure how, or how to track it down.
> 
> I've changed my code a fair amount after getting a bit more educated WRT GPU 
> programming.
> 
> I've got two systems I'm testing on, my laptop (15" macbook pro, NVIDIA 
> GeForce GT 330M 512 MB) and a baby cluster I've built using BCCD (6x debian 
> intel atom itx boards with ION2 graphics built-in).
> 
> The laptop is more portable. ;-)
> 
> I decided to try to use ranluxcl directly inside a custom kernel rather than 
> the cl.rand module (but I read the source and tried to use that as an example 
> of it's use).
> 
> I'm still using the ReductionKernel class to get the final result.
> 
> Here's the code I'm running on the mac:
> 
> <http://www.spvi.com/files/compute_pi_9.py>
> 
> And here are the results....
> 
> <http://www.spvi.com/files/mac-out-9.txt>
> 
> It runs to completion... but notice that the 'random' numbers aren't behaving 
> randomly! I thought the period of ranlux was very large.. so I'm puzzled.
> 
> Next... when I run this code:
> 
> <http://www.spvi.com/files/bccd-compute_pi_9.py>
> 
> on one of the cluster nodes.. I get this:
> 
> <http://www.spvi.com/files/bccd-out-9.txt>
> 
> Wacky! Same code (more or less... just startup is different).
> 
> If I let it keep running it will eventually say "Host memory exhausted" or 
> some-such. By "host" I'm assuming it means the CPU, not the GPU right? Very 
> little host memory involved I think... it's almost entirely on the GPU... but 
> anyway, doesn't memory get freed when the function exits and the local python 
> variables go out of scope? Mysterious!
> 
> I'm pretty sure I'm still missing some basic rule/concept about pyopencl... 
> any feedback appreciated!
> 
> thanks,
> -steve
> 
> On Jan 20, 2012, at 10:55 AM, Andreas Kloeckner wrote:
> 
>>> 
>>> I guess I was hoping for a significant speedup going to a GPU
>>> approach. (note I'm naturally uninterested in the actual value of pi!
>>> I'm just trying to figure out how to get results out of a GPU. I'm
>>> building a small cluster with 6 baby GPUs and I'd like to get smart
>>> about making use of the resource)
>>> 
>>> I'm also a little worried about the warning I'm getting about "can't
>>> query SMD group size". Looking at the source it appears the platform
>>> is returning "Apple" as a vendor, and that case is not treated in the
>>> code that checks.. so it just returns None. When I run
>>> 'dump_properties' I see that the max group size is pretty big!
>>> 
>>> <http://spvi.com/files/pyopencl-mc-lux0-time>
>>> 
>>> Anyway.. I'll try your idea of using enqueue_marker to try to track
>>> down what's really taking the time. (I guess 60% of it *was*
>>> generating excessivly luxurious random numbers!) But I still feel I
>>> should be able to beat the CPU by quite a lot.
>> 
>> Set
>> 
>> export COMPUTE_PROFILE=1
>> 
>> and rerun your code. The driver will have written a profiler log file
>> that breaks down what's using time on the GPU. (This might not be true
>> on Apple CL if you're on a MacBook, not sure if that provides an
>> equivalent facility. If you find out, please report back to the list.)
>> 
>> Next, take into account a GT330M lags by a factor of ~6-7 compared to a
>> 'real' discrete GPU, firstly in mem bandwith (GT330M: 25 MB/s, good
>> discrete chip: ~180 MB/s), and, less critically, in processing
>> power. Also consider that your CPU can probably get to ~10 MB/s mem
>> bandwidth if used well.
>> 
>> HTH,
>> Andreas
>> 
>

_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Re: [PyOpenCL] Trouble understanding/applying ReductionKernel

Reply via email to