On Tue, May 22, 2012 at 6:16 AM, mark florisson <markflorisso...@gmail.com> wrote: > On 22 May 2012 07:48, Robert Bradshaw <rober...@gmail.com> wrote: >> On Mon, May 21, 2012 at 11:36 PM, Dag Sverre Seljebotn >> <d.s.seljeb...@astro.uio.no> wrote: >>> On 05/22/2012 08:11 AM, Robert Bradshaw wrote: >>>> >>>> On Mon, May 21, 2012 at 3:34 AM, Dag Sverre Seljebotn >>>> <d.s.seljeb...@astro.uio.no> wrote: >>>>> >>>>> On 05/20/2012 04:03 PM, mark florisson wrote: >>>>>> >>>>>> >>>>>> Hey, >>>>>> >>>>>> For my gsoc we already have some simple initial ideas, i.e. >>>>>> elementwise vector expressions (a + b with a and b arrays with >>>>>> arbitrary rank), I don't think these need any discussion. However, >>>>>> there are a lot of things that haven't been formally discussed on the >>>>>> mailing list, so here goes. >>>>>> >>>>>> Frédéric, I am CCing you since you expressed interest on the numpy >>>>>> mailing list, and I think your insights as a Theano developer can be >>>>>> very helpful in this discussion. >>>>>> >>>>>> User Interface >>>>>> =========== >>>>>> Besides simple array expressions for dense arrays I would like a >>>>>> mechanism for "custom ufuncs", although to a different extent to what >>>>>> Numpy or Numba provide. There are several ways in which we could want >>>>>> them, e.g. as typed functions (cdef, or external C) functions, as >>>>>> lambas or Python functions in the same module, or as general objects >>>>>> (e.g. functions Cython doesn't know about). >>>>>> To achieve maximum efficiency it will likely be good to allow sharing >>>>>> these functions in .pxd files. We have 'cdef inline' functions, but I >>>>>> would prefer annotated def functions where the parameters are >>>>>> specialized on demand, e.g. >>>>>> >>>>>> @elemental >>>>>> def add(a, b): # elemental functions can have any number of arguments >>>>>> and operate on any compatible dtype >>>>>> return a + b >>>>>> >>>>>> When calling cdef functions or elemental functions with memoryview >>>>>> arguments, the arguments perform a (broadcasted) elementwise >>>>>> operation. Alternatively, we can have a parallel.elementwise function >>>>>> which maps the function elementwise, which would also work for object >>>>>> callables. I prefer the former, since I think it will read much >>>>>> easier. >>>>>> >>>>>> Secondly, we can have a reduce function (and maybe a scan function), >>>>>> that reduce (respectively scan) in a specified axis or number of axes. >>>>>> E.g. >>>>>> >>>>>> parallel.reduce(add, a, b, axis=(0, 2)) >>>>>> >>>>>> where the default for axis is "all axes". As for the default value, >>>>>> this could be perhaps optionally provided to the elemental decorator. >>>>>> Otherwise, the reducer will have to get the default values from each >>>>>> dimension that is reduced in, and then skip those values when >>>>>> reducing. (Of course, the reducer function must be associate and >>>>>> commutative). Also, a lambda could be passed in instead of an >>>>> >>>>> >>>>> >>>>> Only associative, right? >>>>> >>>>> Sounds good to me. >>>>> >>>>> >>>>>> elementwise or typed cdef function. >>>>>> >>>>>> Finally, we would have a parallel.nditer/ndenumerate/nditerate >>>>>> function, which would iterate over N memoryviews, and provide a >>>>>> sensible memory access pattern (like numpy.nditer). I'm not sure if it >>>>>> should provide only the indices, or also the values. e.g. an inplace >>>>>> elementwise add would read as follows: >>>>>> >>>>>> for i, j, k in parallel.nditerate(A, B): >>>>>> A[i, j, k] += B[i, j, k] >>>>> >>>>> >>>>> >>>>> >>>>> I think this sounds good; I guess don't see a particular reason for >>>>> "ndenumerate", I think code like the above is clearer. >>>> >>>> >>>> I'm assuming the index computations would not be re-done in this case >>>> (i.e. there's more magic going on here than looks like at first >>>> glance)? Otherwise there is an advantage to ndenumerate. >>> >>> >>> Ideally, there is a lot more magic going on, though I don't know how far >>> Mark wants to go. >>> >>> Imagine "nditerate(A, A.T)", in that case it would have to make many small >>> tiles so that for each tile being processed, A has a tile in cache and A.T >>> has another tile in cache (so that one doesn't waste cache line transfers). >>> >>> So those array lookups would potentially look up in different memory >>> buffers, with the strides known at compile time. >> >> Yes, being clever about the order in which to iterate over the indices >> is the hard problem to solve here. I was thinking more in terms of the >> inner loop iterating over the innermost dimension only to do the >> indexing (retrieval and assignment), similar to how the generic NumPy >> iterator works. > > That's a valid point, but my experience has been that any worthy C > compiler will do common subexpression elimination for the outer > dimensions and not recompute the offset every time. It actually > generated marginally faster code for scalar assignment than a > "cascaded pointer assignment", i.e. faster than > > p0 = data; > for (...) { > p1 = p0 + i * strides[0] > for (...) { > p2 = p1 + j * strides[1] > ... > } > } > > (haven't tried manual strength reduction there though).
That's a good point, though "for(p2=p1; p2 < precomputed; p2 += stride1) {...}" is probably a better manual reduction. I concede that compilers are really smart about this kind of stuff these days though (though they might not be able to infer that, for example, strides doesn't change). - Robert _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel