Re: [Cython] gsoc: array expressions

Dag Sverre Seljebotn Mon, 21 May 2012 04:14:38 -0700

On 05/21/2012 12:56 PM, mark florisson wrote:

On 21 May 2012 11:34, Dag Sverre Seljebotn<[email protected]>  wrote:

On 05/20/2012 04:03 PM, mark florisson wrote:


Hey,

For my gsoc we already have some simple initial ideas, i.e.
elementwise vector expressions (a + b with a and b arrays with
arbitrary rank), I don't think these need any discussion. However,
there are a lot of things that haven't been formally discussed on the
mailing list, so here goes.

Frédéric, I am CCing you since you expressed interest on the numpy
mailing list, and I think your insights as a Theano developer can be
very helpful in this discussion.

User Interface
===========
Besides simple array expressions for dense arrays I would like a
mechanism for "custom ufuncs", although to a different extent to what
Numpy or Numba provide. There are several ways in which we could want
them, e.g. as typed functions (cdef, or external C) functions, as
lambas or Python functions in the same module, or as general objects
(e.g. functions Cython doesn't know about).
To achieve maximum efficiency it will likely be good to allow sharing
these functions in .pxd files. We have 'cdef inline' functions, but I
would prefer annotated def functions where the parameters are
specialized on demand, e.g.

@elemental
def add(a, b): # elemental functions can have any number of arguments
and operate on any compatible dtype
     return a + b

When calling cdef functions or elemental functions with memoryview
arguments, the arguments perform a (broadcasted) elementwise
operation. Alternatively, we can have a parallel.elementwise function
which maps the function elementwise, which would also work for object
callables. I prefer the former, since I think it will read much
easier.

Secondly, we can have a reduce function (and maybe a scan function),
that reduce (respectively scan) in a specified axis or number of axes.
E.g.

     parallel.reduce(add, a, b, axis=(0, 2))

where the default for axis is "all axes". As for the default value,
this could be perhaps optionally provided to the elemental decorator.
Otherwise, the reducer will have to get the default values from each
dimension that is reduced in, and then skip those values when
reducing. (Of course, the reducer function must be associate and
commutative). Also, a lambda could be passed in instead of an



Only associative, right?

Sounds good to me.


Ah, I guess, because we can reduce thead-local results manually in a
specified (elementwise) order (I was thinking of generating OpenMP
annotated loops, that can be enabled/disabled at the C level, with an
'if' clause with a sensible lower bound of iterations required).

elementwise or typed cdef function.

Finally, we would have a parallel.nditer/ndenumerate/nditerate
function, which would iterate over N memoryviews, and provide a
sensible memory access pattern (like numpy.nditer). I'm not sure if it
should provide only the indices, or also the values. e.g. an inplace
elementwise add would read as follows:

     for i, j, k in parallel.nditerate(A, B):
         A[i, j, k] += B[i, j, k]




I think this sounds good; I guess don't see a particular reason for
"ndenumerate", I think code like the above is clearer.

It's perhaps worth at least thinking about how to support "for idx in ...",
"A[idx[2], Ellipsis] = ...", i.e. arbitrary number of dimensions. Not in
first iteration though.


Yeah, definitely.

Putting it in "parallel" is nice because prange already have out-of-order
semantics.... But of course, there are performance benefits even within a
single thread because of the out-of-order aspect. This should at least be a
big NOTE box in the documentation.


Implementation
===========
Frédéric, feel free to correct me at any point here :)

As for the implementation, I think it will be a good idea to at least
reuse (optionally through command line flags) Theano's optimization
pipeline. I think it would be reasonably easy to build a Theano
expression graph (after fusing the expressions in Cython first), run
the Theano optimizations on that and map back to a Cython AST.
Optionally, we could store a pickled graph representation (or even
compiled theano function?), and provide it as an optional
specialization at runtime (but mapping back correctly to memoryviews
where needed, etc). As Numba matures, a numba runtime specialization
could optionally be provided.



Can you enlighten us a bit about what Theano's optimizations involve? You
mention doing the iteration specializations yourself below, and also the
tiling..

Is it just "scalar" optimizations of the form "x**3 ->  x * x * x" and
numeric stabilization like "log(1 + x) ->  log1p(x)" that would be provided
by Theano?


Yes, it does those kind of things, and it also eliminates common
subexpressions, and it transforms certain expressions to BLAS/LAPACK
functionality. I'm not sure we want that specifically. I'm thinking it
might be more fruitful to start off with a theano-only specialization,
and implement low-level code generation in Theano, and use that from
Cython by either directly dumping in the code, or deferring that to
Theano. At this point I'm not entirely sure.

Still, if this is all Theano provides, I question structuring theproject around reusing Theano. It's the sort of things that arenice-to-have but not fundamental (like memory access patterns).

Put another way, it sounds like Theano could easily be made an optionaldependency currently.

Another question is of course whether it is better to work on Theano toimplement tiling etc. for the CPU (and even compile all thespecializations and select between them).


You could perhaps even have Theano use PEP 3118 rather than NumPy too.

I guess I should subscribe to the Theano list.

Dag
_______________________________________________
cython-devel mailing list
[email protected]
http://mail.python.org/mailman/listinfo/cython-devel

Re: [Cython] gsoc: array expressions

Reply via email to