On Tue, 17 Jan 2012 15:05:00 +0100, Tomasz Rybak <[email protected]> wrote:
> On Mon, 2012-01-16 at 20:58 -0500, Andreas Kloeckner wrote:
> > Hi Tomasz,
> > 
> > > 
> > > I think I found it.
> > > Like in CUDA reduction bug (related to Fermi) it again seems
> > > to be related to too eager concurrency when reducing results.
> > > According to http://oscarbg.blogspot.com/2009/10/news-from-web.html
> > > "Actually the wavefront size is only 64 for the highend cards(48XX,
> > > 58XX, 57XX), but 32 for the middleend cards and 16 for the lowend
> > > cards."
> > > IMO we should use PREFERRED_WORK_GROUP_SIZE_MULTIPLE to get
> > > non_sync_size. At the same size we lose SIMD CPU optimisation,
> > > but I do not know for now how to fix those two at the same time.
> > > Attached patch fixes problem on Loveland, not breaking anything on
> > > NVIDIA ION.
> > > 
> > > Investigating this I have found another problem with reasonable_work_*
> > > function. First, dev.warp_size_nv was raising LogicError (not
> > > AttributeError) so I have changed it to be the same as in
> > > get_simd_group_size. Second, there was problem with getting attributes
> > > from compiled but not build kernel. I had to add prg.build() and
> > > __kernel and __global - without those I was getting SEGFAULT
> > > from AMD OpenCL libraries.
> > 
> > Thank you very much for investigating this, and for your fixes. I've
> > changed your fix slightly, in that get_simd_group() now *uses*
> > reasonable_work_group_size_multiple to find its best guess at the AMD
> > GPU wavefront size.
> > 
> > I'd much appreciate if you could check the current code and report
> > back. We can then debate what to do about PyOpenCL 2012.1 (yes, it'll be
> > that).
> 
> Code works OK on both Loveland and ION (all tests except image on CPU
> pass). I had to add pyopencl.characterize to setup.py (patch attached)
> for package to install characterize on Debian after your changes
> though.

Good catch, thanks. Applied. Now there are two options: Release as-is,
or add a bit more 'scan magic'. By that I mean a) segmented scan and b)
all those little scan-based magic tricks that Thrust can do--copy_if,
unique_by_key, etc. Given that we have a working scan, those aren't hard
to add. It would take about a week, I guess. I'll leave the choice up to
you.

Andreas

Attachment: pgpBIoJxUGFwx.pgp
Description: PGP signature

_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to