On Mon, May 21, 2012 at 11:57 PM, Dag Sverre Seljebotn <d.s.seljeb...@astro.uio.no> wrote: > On 05/22/2012 08:48 AM, Robert Bradshaw wrote: >> >> On Mon, May 21, 2012 at 11:36 PM, Dag Sverre Seljebotn >> <d.s.seljeb...@astro.uio.no> wrote: >>> >>> On 05/22/2012 08:11 AM, Robert Bradshaw wrote: >>>> >>>> >>>> On Mon, May 21, 2012 at 3:34 AM, Dag Sverre Seljebotn >>>> <d.s.seljeb...@astro.uio.no> wrote: >>>>> >>>>> >>>>> On 05/20/2012 04:03 PM, mark florisson wrote: >>>>>> >>>>>> >>>>>> >>>>>> Hey, >>>>>> >>>>>> For my gsoc we already have some simple initial ideas, i.e. >>>>>> elementwise vector expressions (a + b with a and b arrays with >>>>>> arbitrary rank), I don't think these need any discussion. However, >>>>>> there are a lot of things that haven't been formally discussed on the >>>>>> mailing list, so here goes. >>>>>> >>>>>> Frédéric, I am CCing you since you expressed interest on the numpy >>>>>> mailing list, and I think your insights as a Theano developer can be >>>>>> very helpful in this discussion. >>>>>> >>>>>> User Interface >>>>>> =========== >>>>>> Besides simple array expressions for dense arrays I would like a >>>>>> mechanism for "custom ufuncs", although to a different extent to what >>>>>> Numpy or Numba provide. There are several ways in which we could want >>>>>> them, e.g. as typed functions (cdef, or external C) functions, as >>>>>> lambas or Python functions in the same module, or as general objects >>>>>> (e.g. functions Cython doesn't know about). >>>>>> To achieve maximum efficiency it will likely be good to allow sharing >>>>>> these functions in .pxd files. We have 'cdef inline' functions, but I >>>>>> would prefer annotated def functions where the parameters are >>>>>> specialized on demand, e.g. >>>>>> >>>>>> @elemental >>>>>> def add(a, b): # elemental functions can have any number of arguments >>>>>> and operate on any compatible dtype >>>>>> return a + b >>>>>> >>>>>> When calling cdef functions or elemental functions with memoryview >>>>>> arguments, the arguments perform a (broadcasted) elementwise >>>>>> operation. Alternatively, we can have a parallel.elementwise function >>>>>> which maps the function elementwise, which would also work for object >>>>>> callables. I prefer the former, since I think it will read much >>>>>> easier. >>>>>> >>>>>> Secondly, we can have a reduce function (and maybe a scan function), >>>>>> that reduce (respectively scan) in a specified axis or number of axes. >>>>>> E.g. >>>>>> >>>>>> parallel.reduce(add, a, b, axis=(0, 2)) >>>>>> >>>>>> where the default for axis is "all axes". As for the default value, >>>>>> this could be perhaps optionally provided to the elemental decorator. >>>>>> Otherwise, the reducer will have to get the default values from each >>>>>> dimension that is reduced in, and then skip those values when >>>>>> reducing. (Of course, the reducer function must be associate and >>>>>> commutative). Also, a lambda could be passed in instead of an >>>>> >>>>> >>>>> >>>>> >>>>> Only associative, right? >>>>> >>>>> Sounds good to me. >>>>> >>>>> >>>>>> elementwise or typed cdef function. >>>>>> >>>>>> Finally, we would have a parallel.nditer/ndenumerate/nditerate >>>>>> function, which would iterate over N memoryviews, and provide a >>>>>> sensible memory access pattern (like numpy.nditer). I'm not sure if it >>>>>> should provide only the indices, or also the values. e.g. an inplace >>>>>> elementwise add would read as follows: >>>>>> >>>>>> for i, j, k in parallel.nditerate(A, B): >>>>>> A[i, j, k] += B[i, j, k] >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> I think this sounds good; I guess don't see a particular reason for >>>>> "ndenumerate", I think code like the above is clearer. >>>> >>>> >>>> >>>> I'm assuming the index computations would not be re-done in this case >>>> (i.e. there's more magic going on here than looks like at first >>>> glance)? Otherwise there is an advantage to ndenumerate. >>> >>> >>> >>> Ideally, there is a lot more magic going on, though I don't know how far >>> Mark wants to go. >>> >>> Imagine "nditerate(A, A.T)", in that case it would have to make many >>> small >>> tiles so that for each tile being processed, A has a tile in cache and >>> A.T >>> has another tile in cache (so that one doesn't waste cache line >>> transfers). >>> >>> So those array lookups would potentially look up in different memory >>> buffers, with the strides known at compile time. >> >> >> Yes, being clever about the order in which to iterate over the indices >> is the hard problem to solve here. I was thinking more in terms of the >> inner loop iterating over the innermost dimension only to do the >> indexing (retrieval and assignment), similar to how the generic NumPy >> iterator works. > > > The point isn't only being clever about the *order*...you need "copy-in, > copy-out". > > The point is that the NumPy iterator is not good enough (for out-of-cache > situations). Since you grab a cache line (64 bytes) each time from main > memory, a plain NumPy broadcasted iterator throws away a lot of memory for > "A + A.T", since for ndim>1 there's NO iteration order which isn't bad (for > instance, you could iterate in the order of A, and the result would be that > for each element of A.T you fetch there is 64 bytes transferred). > > So the solution is to copy A.T block-wise to a temporary scratch space in > cache so that you use all the elements in the cache line before throwing it > out of cache. > > In C, I've seen a simple blocking transpose operation be over four times > faster than the brute-force transpose for this reason.
Yes, I understand this. Truly element-wise arithmetic with arrays of the same memory layout (or even size) is not that uncommon though, and should be optimized for as well. Fortunately, I feel pretty comfortable sitting back and watching 'cause you've both thought about these issues far more than I and I don't see you both getting it wrong :). - Robert _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel