As this seems to be the codepy/cgen thread I thought I'd tack this on here.
I want to port thrust code that is a little bit more involved than the sort example. Namely the example code for summary statistics ( http://code.google.com/p/thrust/source/browse/examples/summary_statistics.cu ) I think I would be able to port all of this using the appropriate cgens (e.g. Struct, Template) with some tinkering. However, I wonder if it is really necessary to port everything. Is it possible to wrap only the parts I need access to and include the others as one big string. I suppose alternatively I could compute the summary stats quite easily with gpuarray as well. Is there likely to be a performance difference? It seems that that would be easier. Thomas On Thu, May 31, 2012 at 11:31 AM, Bryan Catanzaro <[email protected]>wrote: > Yup, it can make a difference. =) > > The trick you mention for conjugate gradient works because the only > thing control flow has to know is whether to launch another iteration > - but it doesn't need to know what to do during that iteration. The > actual work to be performed in each iteration of CG is independent of > the state of the solver. This isn't the case for many other important > optimization problems, where the next optimization step depends on the > value of the result of the current step. > > - bryan > > On Thu, May 31, 2012 at 8:18 AM, Andreas Kloeckner > <[email protected]> wrote: > > Bryan Catanzaro <[email protected]> writes: > > > >> I agree that data size matters in these discussions. But I think the > >> right way to account for it is show performance at a range of data > >> sizes, as measured from Python. > >> > >> The assumption that you'll keep the GPU busy isn't necessarily true. > >> thrust::reduce, for example (which max_element uses internally), > >> launches a big kernel, followed by a small kernel to finish the > >> reduction tree, followed by a cudaMemcpy to transfer the result back > >> to the host. The GPU won't be busy during the small kernel, nor > >> during the cudaMemcpy, nor during the conversion back to Python, etc. > >> Reduce is often used to make control flow decisions in optimization > >> loops, where you don't know what the next optimization step to be > >> performed is until the result is known, and so you can't launch the > >> work speculatively. If the control flow is performed in Python, all > >> these overheads are exposed to application performance - so I think > >> they matter. > > > > Glad you brought that up. :) The conjugate gradient solver in PyCUDA > > addresses exactly that by simply running iterations as fast as it can > > and shepherding the residual results to the host on their own time, > > deferring convergence decisions until the data is available. That was > > good for a 20% or so gain last time I measured it (on a GT200). > > > > Andreas > > > > _______________________________________________ > PyCUDA mailing list > [email protected] > http://lists.tiker.net/listinfo/pycuda >
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
