On 22 May 2012 07:48, Robert Bradshaw <rober...@gmail.com> wrote: > On Mon, May 21, 2012 at 11:36 PM, Dag Sverre Seljebotn > <d.s.seljeb...@astro.uio.no> wrote: >> On 05/22/2012 08:11 AM, Robert Bradshaw wrote: >>> >>> On Mon, May 21, 2012 at 3:34 AM, Dag Sverre Seljebotn >>> <d.s.seljeb...@astro.uio.no> wrote: >>>> >>>> On 05/20/2012 04:03 PM, mark florisson wrote: >>>>> >>>>> >>>>> Hey, >>>>> >>>>> For my gsoc we already have some simple initial ideas, i.e. >>>>> elementwise vector expressions (a + b with a and b arrays with >>>>> arbitrary rank), I don't think these need any discussion. However, >>>>> there are a lot of things that haven't been formally discussed on the >>>>> mailing list, so here goes. >>>>> >>>>> Frédéric, I am CCing you since you expressed interest on the numpy >>>>> mailing list, and I think your insights as a Theano developer can be >>>>> very helpful in this discussion. >>>>> >>>>> User Interface >>>>> =========== >>>>> Besides simple array expressions for dense arrays I would like a >>>>> mechanism for "custom ufuncs", although to a different extent to what >>>>> Numpy or Numba provide. There are several ways in which we could want >>>>> them, e.g. as typed functions (cdef, or external C) functions, as >>>>> lambas or Python functions in the same module, or as general objects >>>>> (e.g. functions Cython doesn't know about). >>>>> To achieve maximum efficiency it will likely be good to allow sharing >>>>> these functions in .pxd files. We have 'cdef inline' functions, but I >>>>> would prefer annotated def functions where the parameters are >>>>> specialized on demand, e.g. >>>>> >>>>> @elemental >>>>> def add(a, b): # elemental functions can have any number of arguments >>>>> and operate on any compatible dtype >>>>> return a + b >>>>> >>>>> When calling cdef functions or elemental functions with memoryview >>>>> arguments, the arguments perform a (broadcasted) elementwise >>>>> operation. Alternatively, we can have a parallel.elementwise function >>>>> which maps the function elementwise, which would also work for object >>>>> callables. I prefer the former, since I think it will read much >>>>> easier. >>>>> >>>>> Secondly, we can have a reduce function (and maybe a scan function), >>>>> that reduce (respectively scan) in a specified axis or number of axes. >>>>> E.g. >>>>> >>>>> parallel.reduce(add, a, b, axis=(0, 2)) >>>>> >>>>> where the default for axis is "all axes". As for the default value, >>>>> this could be perhaps optionally provided to the elemental decorator. >>>>> Otherwise, the reducer will have to get the default values from each >>>>> dimension that is reduced in, and then skip those values when >>>>> reducing. (Of course, the reducer function must be associate and >>>>> commutative). Also, a lambda could be passed in instead of an >>>> >>>> >>>> >>>> Only associative, right? >>>> >>>> Sounds good to me. >>>> >>>> >>>>> elementwise or typed cdef function. >>>>> >>>>> Finally, we would have a parallel.nditer/ndenumerate/nditerate >>>>> function, which would iterate over N memoryviews, and provide a >>>>> sensible memory access pattern (like numpy.nditer). I'm not sure if it >>>>> should provide only the indices, or also the values. e.g. an inplace >>>>> elementwise add would read as follows: >>>>> >>>>> for i, j, k in parallel.nditerate(A, B): >>>>> A[i, j, k] += B[i, j, k] >>>> >>>> >>>> >>>> >>>> I think this sounds good; I guess don't see a particular reason for >>>> "ndenumerate", I think code like the above is clearer. >>> >>> >>> I'm assuming the index computations would not be re-done in this case >>> (i.e. there's more magic going on here than looks like at first >>> glance)? Otherwise there is an advantage to ndenumerate. >> >> >> Ideally, there is a lot more magic going on, though I don't know how far >> Mark wants to go. >> >> Imagine "nditerate(A, A.T)", in that case it would have to make many small >> tiles so that for each tile being processed, A has a tile in cache and A.T >> has another tile in cache (so that one doesn't waste cache line transfers). >> >> So those array lookups would potentially look up in different memory >> buffers, with the strides known at compile time. > > Yes, being clever about the order in which to iterate over the indices > is the hard problem to solve here. I was thinking more in terms of the > inner loop iterating over the innermost dimension only to do the > indexing (retrieval and assignment), similar to how the generic NumPy > iterator works.
That's a valid point, but my experience has been that any worthy C compiler will do common subexpression elimination for the outer dimensions and not recompute the offset every time. It actually generated marginally faster code for scalar assignment than a "cascaded pointer assignment", i.e. faster than p0 = data; for (...) { p1 = p0 + i * strides[0] for (...) { p2 = p1 + j * strides[1] ... } } (haven't tried manual strength reduction there though). >> Which begs the question: What about this body? >> >> if i < 100: >> continue >> else: >> A[i, j, k] += B[i - 100, j, k] >> >> I guess just fall back to a non-tiled version? One could of course do some >> shifting of which tiles of B to grab etc., but there's a limit to how smart >> one should try to be; one could emit a warning and say that one should slice >> and dice the memoryviews into shape before they are passed to nditerate. > > Linear transformations of the index variables could probably be > handled, but that's certainly not v1 (and not too difficult for the > user to express manually). > > - Robert > _______________________________________________ > cython-devel mailing list > cython-devel@python.org > http://mail.python.org/mailman/listinfo/cython-devel _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel