Francis, A point we are trying to drive at but haven't stated is that the vast majority of bottlenecks in CUDA code are due to inefficient memory access. High performance CUDA code works very hard to limit certain kinds of memory access, but you seem to think these operations will be cheap. This is why Eli and I keep suggesting you use the cpu to do the calculation.
It seems like you decided to work on this as a toy problem to learn CUDA. If that's the case, you will be better served looking at some of the examples in the CUDA SDK instead. I'm sure people on this list could give even better suggestions if you asked. David On Aug 29, 2011 12:49 PM, "Francis" <[email protected]> wrote: > The python list structure stores the length of the list already (it > increments / decrements with appends / pops, etc.), so you'd be > *re*computing a value that you already have. > > Yup, it does. I was thinking of using each thread to get the len( ) of each > sub-list in parallel so I don't have to go through the entire list to get > the length of each sub-list sequentially. > > I think that it would be best at this point for you to implement both > and profile the two implementations to compare runtimes. My > suggestion would be to implement the python-side wrangling first, and > time that vs. my <10 line algo above (I suspect that just the > wrangling will be slower than my solution, much less any call to > CUDA), then add in the CUDA code after that if it still seems like > it's going to be a performance win. > > Yes, more of empirical tests and then tweaking. Thanks again. > > Best regards, > > ./francis
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
