On 21 May 2012 11:34, Dag Sverre Seljebotn <d.s.seljeb...@astro.uio.no> wrote: > On 05/20/2012 04:03 PM, mark florisson wrote: >> >> Hey, >> >> For my gsoc we already have some simple initial ideas, i.e. >> elementwise vector expressions (a + b with a and b arrays with >> arbitrary rank), I don't think these need any discussion. However, >> there are a lot of things that haven't been formally discussed on the >> mailing list, so here goes. >> >> Frédéric, I am CCing you since you expressed interest on the numpy >> mailing list, and I think your insights as a Theano developer can be >> very helpful in this discussion. >> >> User Interface >> =========== >> Besides simple array expressions for dense arrays I would like a >> mechanism for "custom ufuncs", although to a different extent to what >> Numpy or Numba provide. There are several ways in which we could want >> them, e.g. as typed functions (cdef, or external C) functions, as >> lambas or Python functions in the same module, or as general objects >> (e.g. functions Cython doesn't know about). >> To achieve maximum efficiency it will likely be good to allow sharing >> these functions in .pxd files. We have 'cdef inline' functions, but I >> would prefer annotated def functions where the parameters are >> specialized on demand, e.g. >> >> @elemental >> def add(a, b): # elemental functions can have any number of arguments >> and operate on any compatible dtype >> return a + b >> >> When calling cdef functions or elemental functions with memoryview >> arguments, the arguments perform a (broadcasted) elementwise >> operation. Alternatively, we can have a parallel.elementwise function >> which maps the function elementwise, which would also work for object >> callables. I prefer the former, since I think it will read much >> easier. >> >> Secondly, we can have a reduce function (and maybe a scan function), >> that reduce (respectively scan) in a specified axis or number of axes. >> E.g. >> >> parallel.reduce(add, a, b, axis=(0, 2)) >> >> where the default for axis is "all axes". As for the default value, >> this could be perhaps optionally provided to the elemental decorator. >> Otherwise, the reducer will have to get the default values from each >> dimension that is reduced in, and then skip those values when >> reducing. (Of course, the reducer function must be associate and >> commutative). Also, a lambda could be passed in instead of an > > > Only associative, right? > > Sounds good to me. >
Ah, I guess, because we can reduce thead-local results manually in a specified (elementwise) order (I was thinking of generating OpenMP annotated loops, that can be enabled/disabled at the C level, with an 'if' clause with a sensible lower bound of iterations required). >> elementwise or typed cdef function. >> >> Finally, we would have a parallel.nditer/ndenumerate/nditerate >> function, which would iterate over N memoryviews, and provide a >> sensible memory access pattern (like numpy.nditer). I'm not sure if it >> should provide only the indices, or also the values. e.g. an inplace >> elementwise add would read as follows: >> >> for i, j, k in parallel.nditerate(A, B): >> A[i, j, k] += B[i, j, k] > > > > I think this sounds good; I guess don't see a particular reason for > "ndenumerate", I think code like the above is clearer. > > It's perhaps worth at least thinking about how to support "for idx in ...", > "A[idx[2], Ellipsis] = ...", i.e. arbitrary number of dimensions. Not in > first iteration though. Yeah, definitely. > Putting it in "parallel" is nice because prange already have out-of-order > semantics.... But of course, there are performance benefits even within a > single thread because of the out-of-order aspect. This should at least be a > big NOTE box in the documentation. > > >> >> Implementation >> =========== >> Frédéric, feel free to correct me at any point here :) >> >> As for the implementation, I think it will be a good idea to at least >> reuse (optionally through command line flags) Theano's optimization >> pipeline. I think it would be reasonably easy to build a Theano >> expression graph (after fusing the expressions in Cython first), run >> the Theano optimizations on that and map back to a Cython AST. >> Optionally, we could store a pickled graph representation (or even >> compiled theano function?), and provide it as an optional >> specialization at runtime (but mapping back correctly to memoryviews >> where needed, etc). As Numba matures, a numba runtime specialization >> could optionally be provided. > > > Can you enlighten us a bit about what Theano's optimizations involve? You > mention doing the iteration specializations yourself below, and also the > tiling.. > > Is it just "scalar" optimizations of the form "x**3 -> x * x * x" and > numeric stabilization like "log(1 + x) -> log1p(x)" that would be provided > by Theano? Yes, it does those kind of things, and it also eliminates common subexpressions, and it transforms certain expressions to BLAS/LAPACK functionality. I'm not sure we want that specifically. I'm thinking it might be more fruitful to start off with a theano-only specialization, and implement low-level code generation in Theano, and use that from Cython by either directly dumping in the code, or deferring that to Theano. At this point I'm not entirely sure. > If so, such optimizations should be done also for our scalar computations, > not just vector, right? > > Or does Theano deal with the memory access patterns? > I think it does so for the CUDA backend, but not for the C++ backend. I think we need to discuss this stuff on the theano mailing list. >> >> As for the implementation of the C specializations, I currently think >> we should implement our own, since theano heavily uses the numpy C >> API, and since its easier to have an optional theano runtime >> specialization anyway. I propose the following specializations, to be >> selected at runtime >> >> - vectorized contiguous, collapsed and aligned >> - this function can be called by a strided, inner dimension >> contiguous specialization >> - tiled (ndenumerate/nditerate) >> - tiled vectorized >> - plain C loops >> >> With 'aligned' it is not meant that the data itself should be aligned, >> but that they are aligned at the same (unaligned) offset. > > > Sounds good. > > About implementing this in Cython, would you simply turn > > a[...] = b + c > > into > > for i, j, k in parallel.nditerate(a, b, c): > a[i, j, k] = b[i, j, k] + c[i, j, k] > > and then focus on optimizing nditerate? That seemed like the logical path to > me, but perhaps that doesn't play nicely with using Theano to optimize the > expression? (Again I need a clearer picture of what this involves.) > > (Yes, AMD and I guess probably Intel too seem to have moved towards making > unaligned MOV as fast as aligned MOV I guess, so no need to worry about > that.) > > >> A runtime compiler could probably do much better here and allow for >> shuffling in the vectorized code for a minimal subset of the operands. >> Maybe it would be useful to provide a vectorized version where each >> operand is shuffled and the shuffle arguments are created up front? >> That might still be faster than non-vectorized... Anyway, the most >> important part would be tiling and proper memory access patterns. >> >> Which specialization is picked depends on a) which flags were passed >> to Cython, b) the runtime memory layout and c) what macros were >> defined when the Cython module was compiled. >> >> Criticism and constructive discussion welcome :) >> >> Cheers, >> >> Mark >> (heading out for lunch now) > > > Dag > _______________________________________________ > cython-devel mailing list > cython-devel@python.org > http://mail.python.org/mailman/listinfo/cython-devel _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel