As this seems to be the codepy/cgen thread I thought I'd tack this on here.

I want to port thrust code that is a little bit more involved than the sort
example. Namely the example code for summary statistics (
http://code.google.com/p/thrust/source/browse/examples/summary_statistics.cu
)

I think I would be able to port all of this using the appropriate cgens
(e.g. Struct, Template) with some tinkering. However, I wonder if it is
really necessary to port everything. Is it possible to wrap only the parts
I need access to and include the others as one big string.

I suppose alternatively I could compute the summary stats quite easily with
gpuarray as well. Is there likely to be a performance difference? It seems
that that would be easier.

Thomas

On Thu, May 31, 2012 at 11:31 AM, Bryan Catanzaro <[email protected]>wrote:

> Yup, it can make a difference. =)
>
> The trick you mention for conjugate gradient works because the only
> thing control flow has to know is whether to launch another iteration
> - but it doesn't need to know what to do during that iteration. The
> actual work to be performed in each iteration of CG is independent of
> the state of the solver.  This isn't the case for many other important
> optimization problems, where the next optimization step depends on the
> value of the result of the current step.
>
> - bryan
>
> On Thu, May 31, 2012 at 8:18 AM, Andreas Kloeckner
> <[email protected]> wrote:
> > Bryan Catanzaro <[email protected]> writes:
> >
> >> I agree that data size matters in these discussions.  But I think the
> >> right way to account for it is show performance at a range of data
> >> sizes, as measured from Python.
> >>
> >> The assumption that you'll keep the GPU busy isn't necessarily true.
> >> thrust::reduce, for example (which max_element uses internally),
> >> launches a big kernel, followed by a small kernel to finish the
> >> reduction tree, followed by a cudaMemcpy to transfer the result back
> >> to the host.  The GPU won't be busy during the small kernel, nor
> >> during the cudaMemcpy, nor during the conversion back to Python, etc.
> >> Reduce is often used to make control flow decisions in optimization
> >> loops, where you don't know what the next optimization step to be
> >> performed is until the result is known, and so you can't launch the
> >> work speculatively.  If the control flow is performed in Python, all
> >> these overheads are exposed to application performance - so I think
> >> they matter.
> >
> > Glad you brought that up. :) The conjugate gradient solver in PyCUDA
> > addresses exactly that by simply running iterations as fast as it can
> > and shepherding the residual results to the host on their own time,
> > deferring convergence decisions until the data is available. That was
> > good for a 20% or so gain last time I measured it (on a GT200).
> >
> > Andreas
> >
>
> _______________________________________________
> PyCUDA mailing list
> [email protected]
> http://lists.tiker.net/listinfo/pycuda
>
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to