On Tue, 17 Jan 2012 15:05:00 +0100, Tomasz Rybak <[email protected]> wrote: > On Mon, 2012-01-16 at 20:58 -0500, Andreas Kloeckner wrote: > > Hi Tomasz, > > > > > > > > I think I found it. > > > Like in CUDA reduction bug (related to Fermi) it again seems > > > to be related to too eager concurrency when reducing results. > > > According to http://oscarbg.blogspot.com/2009/10/news-from-web.html > > > "Actually the wavefront size is only 64 for the highend cards(48XX, > > > 58XX, 57XX), but 32 for the middleend cards and 16 for the lowend > > > cards." > > > IMO we should use PREFERRED_WORK_GROUP_SIZE_MULTIPLE to get > > > non_sync_size. At the same size we lose SIMD CPU optimisation, > > > but I do not know for now how to fix those two at the same time. > > > Attached patch fixes problem on Loveland, not breaking anything on > > > NVIDIA ION. > > > > > > Investigating this I have found another problem with reasonable_work_* > > > function. First, dev.warp_size_nv was raising LogicError (not > > > AttributeError) so I have changed it to be the same as in > > > get_simd_group_size. Second, there was problem with getting attributes > > > from compiled but not build kernel. I had to add prg.build() and > > > __kernel and __global - without those I was getting SEGFAULT > > > from AMD OpenCL libraries. > > > > Thank you very much for investigating this, and for your fixes. I've > > changed your fix slightly, in that get_simd_group() now *uses* > > reasonable_work_group_size_multiple to find its best guess at the AMD > > GPU wavefront size. > > > > I'd much appreciate if you could check the current code and report > > back. We can then debate what to do about PyOpenCL 2012.1 (yes, it'll be > > that). > > Code works OK on both Loveland and ION (all tests except image on CPU > pass). I had to add pyopencl.characterize to setup.py (patch attached) > for package to install characterize on Debian after your changes > though.
Good catch, thanks. Applied. Now there are two options: Release as-is, or add a bit more 'scan magic'. By that I mean a) segmented scan and b) all those little scan-based magic tricks that Thrust can do--copy_if, unique_by_key, etc. Given that we have a working scan, those aren't hard to add. It would take about a week, I guess. I'll leave the choice up to you. Andreas
pgpBIoJxUGFwx.pgp
Description: PGP signature
_______________________________________________ PyOpenCL mailing list [email protected] http://lists.tiker.net/listinfo/pyopencl
