On 21 May 2012 12:08, mark florisson <markflorisso...@gmail.com> wrote: > On 21 May 2012 11:34, Dag Sverre Seljebotn <d.s.seljeb...@astro.uio.no> wrote: >> On 05/20/2012 04:03 PM, mark florisson wrote: >>> >>> Hey, >>> >>> For my gsoc we already have some simple initial ideas, i.e. >>> elementwise vector expressions (a + b with a and b arrays with >>> arbitrary rank), I don't think these need any discussion. However, >>> there are a lot of things that haven't been formally discussed on the >>> mailing list, so here goes. >>> >>> Frédéric, I am CCing you since you expressed interest on the numpy >>> mailing list, and I think your insights as a Theano developer can be >>> very helpful in this discussion. >>> >>> User Interface >>> =========== >>> Besides simple array expressions for dense arrays I would like a >>> mechanism for "custom ufuncs", although to a different extent to what >>> Numpy or Numba provide. There are several ways in which we could want >>> them, e.g. as typed functions (cdef, or external C) functions, as >>> lambas or Python functions in the same module, or as general objects >>> (e.g. functions Cython doesn't know about). >>> To achieve maximum efficiency it will likely be good to allow sharing >>> these functions in .pxd files. We have 'cdef inline' functions, but I >>> would prefer annotated def functions where the parameters are >>> specialized on demand, e.g. >>> >>> @elemental >>> def add(a, b): # elemental functions can have any number of arguments >>> and operate on any compatible dtype >>> return a + b >>> >>> When calling cdef functions or elemental functions with memoryview >>> arguments, the arguments perform a (broadcasted) elementwise >>> operation. Alternatively, we can have a parallel.elementwise function >>> which maps the function elementwise, which would also work for object >>> callables. I prefer the former, since I think it will read much >>> easier. >>> >>> Secondly, we can have a reduce function (and maybe a scan function), >>> that reduce (respectively scan) in a specified axis or number of axes. >>> E.g. >>> >>> parallel.reduce(add, a, b, axis=(0, 2)) >>> >>> where the default for axis is "all axes". As for the default value, >>> this could be perhaps optionally provided to the elemental decorator. >>> Otherwise, the reducer will have to get the default values from each >>> dimension that is reduced in, and then skip those values when >>> reducing. (Of course, the reducer function must be associate and >>> commutative). Also, a lambda could be passed in instead of an >> >> >> Only associative, right? >> >> Sounds good to me. >> >> >>> elementwise or typed cdef function. >>> >>> Finally, we would have a parallel.nditer/ndenumerate/nditerate >>> function, which would iterate over N memoryviews, and provide a >>> sensible memory access pattern (like numpy.nditer). I'm not sure if it >>> should provide only the indices, or also the values. e.g. an inplace >>> elementwise add would read as follows: >>> >>> for i, j, k in parallel.nditerate(A, B): >>> A[i, j, k] += B[i, j, k] >> >> >> >> I think this sounds good; I guess don't see a particular reason for >> "ndenumerate", I think code like the above is clearer. >> >> It's perhaps worth at least thinking about how to support "for idx in ...", >> "A[idx[2], Ellipsis] = ...", i.e. arbitrary number of dimensions. Not in >> first iteration though. >> >> Putting it in "parallel" is nice because prange already have out-of-order >> semantics.... But of course, there are performance benefits even within a >> single thread because of the out-of-order aspect. This should at least be a >> big NOTE box in the documentation. >> >> >>> >>> Implementation >>> =========== >>> Frédéric, feel free to correct me at any point here :) >>> >>> As for the implementation, I think it will be a good idea to at least >>> reuse (optionally through command line flags) Theano's optimization >>> pipeline. I think it would be reasonably easy to build a Theano >>> expression graph (after fusing the expressions in Cython first), run >>> the Theano optimizations on that and map back to a Cython AST. >>> Optionally, we could store a pickled graph representation (or even >>> compiled theano function?), and provide it as an optional >>> specialization at runtime (but mapping back correctly to memoryviews >>> where needed, etc). As Numba matures, a numba runtime specialization >>> could optionally be provided. >> >> >> Can you enlighten us a bit about what Theano's optimizations involve? You >> mention doing the iteration specializations yourself below, and also the >> tiling.. >> >> Is it just "scalar" optimizations of the form "x**3 -> x * x * x" and >> numeric stabilization like "log(1 + x) -> log1p(x)" that would be provided >> by Theano? >> >> If so, such optimizations should be done also for our scalar computations, >> not just vector, right? > > As for this, I don't think CSE is important for scalar computations, > since if they are objects, you have to go through a Python layer since > you have no idea what the code does, and if they are C objects, the C > compiler will readily optimize that out. You want this for vector > expressions since the computations may be expensive and the C > compiler may not optimize them. E.g. consider the silly example of v1 > * sum(A) + v2 * sum(A). It's more convenient to write than to > introduce a new variable manually. > >> Or does Theano deal with the memory access patterns? >> >> >>> >>> As for the implementation of the C specializations, I currently think >>> we should implement our own, since theano heavily uses the numpy C >>> API, and since its easier to have an optional theano runtime >>> specialization anyway. I propose the following specializations, to be >>> selected at runtime >>> >>> - vectorized contiguous, collapsed and aligned >>> - this function can be called by a strided, inner dimension >>> contiguous specialization >>> - tiled (ndenumerate/nditerate) >>> - tiled vectorized >>> - plain C loops >>> >>> With 'aligned' it is not meant that the data itself should be aligned, >>> but that they are aligned at the same (unaligned) offset. >> >> >> Sounds good. >> >> About implementing this in Cython, would you simply turn >> >> a[...] = b + c >> >> into >> >> for i, j, k in parallel.nditerate(a, b, c): >> a[i, j, k] = b[i, j, k] + c[i, j, k] >> >> and then focus on optimizing nditerate? That seemed like the logical path to >> me, but perhaps that doesn't play nicely with using Theano to optimize the >> expression? (Again I need a clearer picture of what this involves.) > > I don't think so, since I think we want to try explicit vectorization. > I think the iteration and tiling mechanism etc will be shared by both, > but we wouldn't provide that direct mapping (I think). Although maybe > we could insert a VectorAssignment with VectorScalars... let's see, in > any case both will be optimized in the same way, except that with > vector expressions you know you're always getting the best the > compiler can do.
(Where of course the nditerate loop striding pattern would be patched accordingly.) >> (Yes, AMD and I guess probably Intel too seem to have moved towards making >> unaligned MOV as fast as aligned MOV I guess, so no need to worry about >> that.) >> >> >>> A runtime compiler could probably do much better here and allow for >>> shuffling in the vectorized code for a minimal subset of the operands. >>> Maybe it would be useful to provide a vectorized version where each >>> operand is shuffled and the shuffle arguments are created up front? >>> That might still be faster than non-vectorized... Anyway, the most >>> important part would be tiling and proper memory access patterns. >>> >>> Which specialization is picked depends on a) which flags were passed >>> to Cython, b) the runtime memory layout and c) what macros were >>> defined when the Cython module was compiled. >>> >>> Criticism and constructive discussion welcome :) >>> >>> Cheers, >>> >>> Mark >>> (heading out for lunch now) >> >> >> Dag >> _______________________________________________ >> cython-devel mailing list >> cython-devel@python.org >> http://mail.python.org/mailman/listinfo/cython-devel _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel