Re: [Cython] gsoc: array expressions

Robert Bradshaw Tue, 22 May 2012 00:06:58 -0700

On Mon, May 21, 2012 at 11:57 PM, Dag Sverre Seljebotn
<d.s.seljeb...@astro.uio.no> wrote:
> On 05/22/2012 08:48 AM, Robert Bradshaw wrote:
>>
>> On Mon, May 21, 2012 at 11:36 PM, Dag Sverre Seljebotn
>> <d.s.seljeb...@astro.uio.no>  wrote:
>>>
>>> On 05/22/2012 08:11 AM, Robert Bradshaw wrote:
>>>>
>>>>
>>>> On Mon, May 21, 2012 at 3:34 AM, Dag Sverre Seljebotn
>>>> <d.s.seljeb...@astro.uio.no>    wrote:
>>>>>
>>>>>
>>>>> On 05/20/2012 04:03 PM, mark florisson wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hey,
>>>>>>
>>>>>> For my gsoc we already have some simple initial ideas, i.e.
>>>>>> elementwise vector expressions (a + b with a and b arrays with
>>>>>> arbitrary rank), I don't think these need any discussion. However,
>>>>>> there are a lot of things that haven't been formally discussed on the
>>>>>> mailing list, so here goes.
>>>>>>
>>>>>> Frédéric, I am CCing you since you expressed interest on the numpy
>>>>>> mailing list, and I think your insights as a Theano developer can be
>>>>>> very helpful in this discussion.
>>>>>>
>>>>>> User Interface
>>>>>> ===========
>>>>>> Besides simple array expressions for dense arrays I would like a
>>>>>> mechanism for "custom ufuncs", although to a different extent to what
>>>>>> Numpy or Numba provide. There are several ways in which we could want
>>>>>> them, e.g. as typed functions (cdef, or external C) functions, as
>>>>>> lambas or Python functions in the same module, or as general objects
>>>>>> (e.g. functions Cython doesn't know about).
>>>>>> To achieve maximum efficiency it will likely be good to allow sharing
>>>>>> these functions in .pxd files. We have 'cdef inline' functions, but I
>>>>>> would prefer annotated def functions where the parameters are
>>>>>> specialized on demand, e.g.
>>>>>>
>>>>>> @elemental
>>>>>> def add(a, b): # elemental functions can have any number of arguments
>>>>>> and operate on any compatible dtype
>>>>>>     return a + b
>>>>>>
>>>>>> When calling cdef functions or elemental functions with memoryview
>>>>>> arguments, the arguments perform a (broadcasted) elementwise
>>>>>> operation. Alternatively, we can have a parallel.elementwise function
>>>>>> which maps the function elementwise, which would also work for object
>>>>>> callables. I prefer the former, since I think it will read much
>>>>>> easier.
>>>>>>
>>>>>> Secondly, we can have a reduce function (and maybe a scan function),
>>>>>> that reduce (respectively scan) in a specified axis or number of axes.
>>>>>> E.g.
>>>>>>
>>>>>>     parallel.reduce(add, a, b, axis=(0, 2))
>>>>>>
>>>>>> where the default for axis is "all axes". As for the default value,
>>>>>> this could be perhaps optionally provided to the elemental decorator.
>>>>>> Otherwise, the reducer will have to get the default values from each
>>>>>> dimension that is reduced in, and then skip those values when
>>>>>> reducing. (Of course, the reducer function must be associate and
>>>>>> commutative). Also, a lambda could be passed in instead of an
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Only associative, right?
>>>>>
>>>>> Sounds good to me.
>>>>>
>>>>>
>>>>>> elementwise or typed cdef function.
>>>>>>
>>>>>> Finally, we would have a parallel.nditer/ndenumerate/nditerate
>>>>>> function, which would iterate over N memoryviews, and provide a
>>>>>> sensible memory access pattern (like numpy.nditer). I'm not sure if it
>>>>>> should provide only the indices, or also the values. e.g. an inplace
>>>>>> elementwise add would read as follows:
>>>>>>
>>>>>>     for i, j, k in parallel.nditerate(A, B):
>>>>>>         A[i, j, k] += B[i, j, k]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I think this sounds good; I guess don't see a particular reason for
>>>>> "ndenumerate", I think code like the above is clearer.
>>>>
>>>>
>>>>
>>>> I'm assuming the index computations would not be re-done in this case
>>>> (i.e. there's more magic going on here than looks like at first
>>>> glance)? Otherwise there is an advantage to ndenumerate.
>>>
>>>
>>>
>>> Ideally, there is a lot more magic going on, though I don't know how far
>>> Mark wants to go.
>>>
>>> Imagine "nditerate(A, A.T)", in that case it would have to make many
>>> small
>>> tiles so that for each tile being processed, A has a tile in cache and
>>> A.T
>>> has another tile in cache (so that one doesn't waste cache line
>>> transfers).
>>>
>>> So those array lookups would potentially look up in different memory
>>> buffers, with the strides known at compile time.
>>
>>
>> Yes, being clever about the order in which to iterate over the indices
>> is the hard problem to solve here. I was thinking more in terms of the
>> inner loop iterating over the innermost dimension only to do the
>> indexing (retrieval and assignment), similar to how the generic NumPy
>> iterator works.
>
>
> The point isn't only being clever about the *order*...you need "copy-in,
> copy-out".
>
> The point is that the NumPy iterator is not good enough (for out-of-cache
> situations). Since you grab a cache line (64 bytes) each time from main
> memory, a plain NumPy broadcasted iterator throws away a lot of memory for
> "A + A.T", since for ndim>1 there's NO iteration order which isn't bad (for
> instance, you could iterate in the order of A, and the result would be that
> for each element of A.T you fetch there is 64 bytes transferred).
>
> So the solution is to copy A.T block-wise to a temporary scratch space in
> cache so that you use all the elements in the cache line before throwing it
> out of cache.
>
> In C, I've seen a simple blocking transpose operation be over four times
> faster than the brute-force transpose for this reason.


Yes, I understand this. Truly element-wise arithmetic with arrays of
the same memory layout (or even size) is not that uncommon though, and
should be optimized for as well. Fortunately, I feel pretty
comfortable sitting back and watching 'cause you've both thought about
these issues far more than I and I don't see you both getting it wrong
:).

- Robert
_______________________________________________
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel

Re: [Cython] gsoc: array expressions

Reply via email to