[PyCUDA] Fwd: Re: broadcasting and strided data

Andreas Klöckner Thu, 27 Aug 2009 07:06:57 -0700

Forgot to cc the list. Darn. :)

----------  Forwarded Message  ----------

Betreff: Re: [PyCUDA] broadcasting and strided data
Datum: Donnerstag 27 August 2009
Von: Andreas Klöckner <[email protected]>
An: James Bergstra <[email protected]>

On Dienstag 25 August 2009, you wrote:
> It probably requires the expertise of a few people to get the design
> right, so I'm reluctant even to try to put a patch together.  First,
> it requires some changes to the data container.  Some of the issues
> that come up are:
> - what should be the strides for broadcastable dimensions (I like 0,
> but numpy does it differently)

Assigning a stride zero seems to be a good "simple" way, even though it seems 
like that might waste some processor power on unneeded index math. How does 
numpy do it?

> - should strides be in data-type units or byte units

I find this somewhat irrelevant--for the kernels themselves, data-type units 
are likely more useful, especially if texturing is used. For storage, looking 
like numpy by using byte offsets might be the way to go. Since doing the 
conversion on the host right ahead of the kernel invocation is easy and cheap, 
I don't see why we can't have our cake and eat it, too. (see also next 
question)

> - should strides and dimensions be stored in host memory, device
> memory, or both (how/when should they be synchronized?)

Host memory seems to be the right place, as kernel parameters, originating 
from there, are the only way by which a variable can be easily spread to each 
thread, without incurring a global mem access penalty.

> As the data structure gets more complicated, the kernels become more
> complex too.   My experience is that all kernels have to have a
> "general" version that is pretty slow, and progressively, more and
> more special cases get optimized.

I find it helpful to do things the other way around. Solve a rather special 
case first, then generalize. Even incremental solutions are valuable.

> Kernel code generators get bloated.

Deciding on the right complexity for the generators is definitely an issue. 
Rome wasn't built in a day. Going about this incrementally and not rushing it 
seems like a wise idea. You're not on your own.

> How many kinds of kernels are there in PyCUDA right now?  (Given that
> the same code-generator can produce many elementwise kernels, I mean
> to count that as one *kind* of kernel.)  How many things would break
> if arrays were strided?

Two. Elementwise kernels and reduction kernels are the kinds currently 
implemented. All of the GpuArray functionality is written in terms of these 
two.

Andreas

-------------------------------------------------------

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
PyCUDA mailing list
[email protected]
http://tiker.net/mailman/listinfo/pycuda_tiker.net

[PyCUDA] Fwd: Re: broadcasting and strided data

Reply via email to