On 14 May 2012 21:36, Dag Sverre Seljebotn <d.s.seljeb...@astro.uio.no> wrote: > On 05/14/2012 06:31 PM, mark florisson wrote: >> >> On 12 May 2012 22:55, Dag Sverre Seljebotn<d.s.seljeb...@astro.uio.no> >> wrote: >>> >>> On 05/11/2012 03:37 PM, mark florisson wrote: >>>> >>>> >>>> On 11 May 2012 12:13, Dag Sverre Seljebotn<d.s.seljeb...@astro.uio.no> >>>> wrote: >>>>> >>>>> >>>>> (NumPy devs: I know, I get too many ideas. But this time I *really* >>>>> believe >>>>> in it, I think this is going to be *huge*. And if Mark F. likes it it's >>>>> not >>>>> going to be without manpower; and as his mentor I'd pitch in too here >>>>> and >>>>> there.) >>>>> >>>>> (Mark F.: I believe this is *very* relevant to your GSoC. I certainly >>>>> don't >>>>> want to micro-manage your GSoC, just have your take.) >>>>> >>>>> Travis, thank you very much for those good words in the "NA-mask >>>>> interactions..." thread. It put most of my concerns away. If anybody is >>>>> leaning towards for opaqueness because of its OOP purity, I want to >>>>> refer >>>>> to >>>>> C++ and its walled-garden of ideological purity -- it has, what, 3-4 >>>>> different OOP array libraries, neither of which is able to out-compete >>>>> the >>>>> other. Meanwhile the rest of the world happily cooperates using >>>>> pointers, >>>>> strides, CSR and CSC. >>>>> >>>>> Now, there are limits to what you can do with strides and pointers. >>>>> Noone's >>>>> denying the need for more. In my mind that's an API where you can do >>>>> fetch_block and put_block of cache-sized, N-dimensional blocks on an >>>>> array; >>>>> but it might be something slightly different. >>>>> >>>>> Here's what I'm asking: DO NOT simply keep extending ndarray and the >>>>> NumPy C >>>>> API to deal with this issue. >>>>> >>>>> What we need is duck-typing/polymorphism at the C level. If you keep >>>>> extending ndarray and the NumPy C API, what we'll have is a one-to-many >>>>> relationship: One provider of array technology, multiple consumers >>>>> (with >>>>> hooks, I'm sure, but all implementations of the hook concept in the >>>>> NumPy >>>>> world I've seen so far are a total disaster!). >>>>> >>>>> What I think we need instead is something like PEP 3118 for the >>>>> "abstract" >>>>> array that is only available block-wise with getters and setters. On >>>>> the >>>>> Cython list we've decided that what we want for CEP 1000 (for boxing >>>>> callbacks etc.) is to extend PyTypeObject with our own fields; we could >>>>> create CEP 1001 to solve this issue and make any Python object an >>>>> exporter >>>>> of "block-getter/setter-arrays" (better name needed). >>>>> >>>>> What would be exported is (of course) a simple vtable: >>>>> >>>>> typedef struct { >>>>> int (*get_block)(void *ctx, ssize_t *upper_left, ssize_t >>>>> *lower_right, >>>>> ...); >>>>> ... >>>>> } block_getter_setter_array_vtable; >>>>> >>>>> Let's please discuss the details *after* the fundamentals. But the >>>>> reason >>>>> I >>>>> put void* there instead of PyObject* is that I hope this could be used >>>>> beyond the Python world (say, Python<->Julia); the void* would be >>>>> handed >>>>> to >>>>> you at the time you receive the vtable (however we handle that). >>>> >>>> >>>> >>>> I suppose it would also be useful to have some way of predicting the >>>> output format polymorphically for the caller. E.g. dense * >>>> block_diagonal results in block diagonal, but dense + block_diagonal >>>> results in dense, etc. It might be useful for the caller to know >>>> whether it needs to allocate a sparse, dense or block-structured >>>> array. Or maybe the polymorphic function could even do the allocation. >>>> This needs to happen recursively of course, to avoid intermediate >>>> temporaries. The compiler could easily handle that, and so could numpy >>>> when it gets lazy evaluation. >>> >>> >>> >>> Ah. But that depends too on the computation to be performed too; a) >>> elementwise, b) axis-wise reductions, c) linear algebra... >>> >>> In my oomatrix code (please don't look at it, it's shameful) I do this >>> using >>> multiple dispatch. >>> >>> I'd rather ignore this for as long as we can, only implementing "a[:] = >>> ..." >>> -- I can't see how decisions here would trickle down to the API that's >>> used >>> in the kernel, it's more like a pre-phase, and better treated >>> orthogonally. >>> >>> >>>> I think if the heavy lifting of allocating output arrays and exporting >>>> these arrays work in numpy, then support in Cython could use that (I >>>> can already hear certain people object to more complicated array stuff >>>> in Cython :). Even better here would be an external project that each >>>> our projects could use (I still think the nditer sorting functionality >>>> of arrays should be numpy-agnostic and externally available). >>> >>> >>> >>> I agree with the separate project idea. It's trivial for NumPy to >>> incorporate that as one of its methods for exporting arrays, and I don't >>> think it makes sense to either build it into Cython, or outright depend >>> on >>> NumPy. >>> >>> Here's what I'd like (working title: NumBridge?). >>> >>> - Mission: Be the "double* + shape + strides" in a world where that is >>> no >>> longer enough, by providing tight, focused APIs/ABIs that are usable >>> across >>> C/Fortran/Python. >>> >>> I basically want something I can quickly acquire from a NumPy array, then >>> pass it into my C code without dragging along all the cruft that I don't >>> need. >>> >>> - Written in pure C + specs, usable without Python >>> >>> - PEP 3118 "done right", basically semi-standardize the internal Cython >>> memoryview ABI and get something that's passable on stack >>> >>> - Get block get/put API >>> >>> - Iterator APIs >>> >>> - Utility code for exporters and clients (iteration code, axis >>> reordering, >>> etc.) >>> >>> Is the scope of that insane, or is it at least worth a shot to see how >>> bad >>> it is? Beyond figuring out a small subset that can be done first, and >>> whether performance considerations must be taken or not, there's two >>> complicating factors: Pluggable dtypes, memory management. Perhaps you >>> could >>> come to Oslo for a couple of days to brainstorm... >>> >>> Dag >> >> >> The blocks are a good idea, but I think fairly complicated for >> complicated matrix layouts. It would be nice to have something that is >> reasonably efficient for at least most of the array storage >> mechanisms. >> I'm going to do a little brain dump below, let's see if anything is useful >> :) >> >> What if we basically take the CSR format and make it a little simpler, >> easier to handle, and better suited for other layouts. Basically, keep >> shape/strides for dense arrays, and for blocked arrays just "extend" >> your number of dimensions, i.e. a 2D blocked array becomes a 4D array, >> something like this: >> >>>>> a = np.arange(4).repeat(4).reshape(4, 4); >>>>> a >> >> array([[0, 0, 0, 0], >> [1, 1, 1, 1], >> [2, 2, 2, 2], >> [3, 3, 3, 3]]) >> >>>>> a.shape = (2, 2, 2, 2) >>>>> itemsize = a.dtype.itemsize >>>>> a.strides = (8 * itemsize, 2 * itemsize, 4 * itemsize, 1 * itemsize) >>>>> a >> >> array([[[[0, 0], >> [1, 1]], >> >> [[0, 0], >> [1, 1]]], >> >> >> [[[2, 2], >> [3, 3]], >> >> [[2, 2], >> [3, 3]]]]) >> >>>>> print a.flatten() >> >> [0 0 1 1 0 0 1 1 2 2 3 3 2 2 3 3] >> >> Now, given some buffer flag (PyBUF_Sparse or something), use basically >> suboffsets with indirect dimensions, where only ever the last >> dimension is a row of contiguous memory (the entire thing may be >> contiguous, but the point is that you don't care). This row may >> >> - be variable sized >> - have either a single "column start" (e.g. diagonal, band/tri- >> diagonal, block diagonal, etc), OR >> - a list of column indices, the length of the row (like in the CSR >> format) >> >> The length of each innermost row is then determined by looking at, in >> order: >> - the extent as specified in the shape list >> - if -1, and some flag is set, the extent is determined like CSR, >> i.e. (uintptr_t) row[i + 1] - (uintptr_t) row[i] >> -> maybe instead of pointers indices are better, for >> serialization, GPUs, etc >> - otherwise, use either a function pointer or perhaps a list of >> extents >> >> All these details will obviously be abstracted, allowing for easy >> iteration, but it can also be used by ufuncs operating on contiguous >> rows (often, since the actual storage is contiguous and stored in a 1D >> array, some flag could indicate contiguity as an optimization for >> unary ufuncs and flat iteration). Tiled nditer-ation could also work >> here without too big a hassle. >> When you slice, you add to the suboffset and manipulate the single >> extent or entire list of extents in that dimension, and the flag to >> determine the length using the pointer subtraction should be cleared >> (this should probably all happen through vtable functions). >> >> An exporter would also be free to use different malloced pointers, >> allowing variable sized array support with append/pop functionality in >> multiple dimensions (if there are no active buffer views). >> >> Random access in the case where a column start is provided is still >> contant time, and done though a[i][j][k - colstart], unless you have >> discontiguous rows, in which case you are allowed a logarithmic search >> (if the size exceeds some threshold). I see scipy.sparse does a linear >> search, which is pretty slow in pathological cases: >> >> from scipy import sparse >> a = np.random.random((1, 4000000)) >> b = sparse.csr_matrix(a) >> %timeit a[0, 1000] >> %timeit b[0, 1000] >> >> 10000000 loops, best of 3: 190 ns per loop >> 10 loops, best of 3: 29.3 ms per loop > > > Heh. That is *extremely* pathological though, nobody does that in real code > :-) > > Here's an idea I had yesterday: To get an ND sparse array with good spatial > locality (for local operations etc.), you could map the elements to a > volume-filling fractal and then use a hash-map with linear probing. I bet it > either doesn't work or has been done already :-) > > >> >> Now, depending on the density and operation, the caller may have some >> idea of how to allocate an output array. I'm not sure how to handle >> "insertions" of new elements, but I presume through vtable put/insert >> functions. I'm also not sure how to fit this in with linear algebra >> functionality, other than providing conversions of the view. >> >> I think a disadvantage of this scheme is that you can't reorder your >> axes anymore, and many operations that are easy in the dense case are >> suddenly harder, e.g. this scheme does not allow you to go from a >> csr-like format into csc. But I think what this gives is reasonable >> generality to allow easy use in C/Fortran/Cython compiled code/numpy >> ufunc invocation, as well as allowing efficient-ish storage for >> various kinds of arrays. >> >> Does this make any sense? > > > I'll admit I didn't go through the finer details of your idea; let's deal > with the below first and then I can re-read your post later. > > What I'm thinking is that this seems interesting, but perhaps it lowers the > versatility so much that it's not really worth to consider, for the GSoC at > least. If the impact isn't high enough, my hunch is that one may as well not > bother and just do strided arrays. > > I actually believe that the *likely* outcome of this discussion is to stick > to the original plan and focus on expressions with strided arrays. But let's > see. > > I'm not sure if what brought me to the blocked case was really block-sparse > arrays or diagonal arrays. Let's see...: > > 1) Memory conservation. The array would not be NOT element-sparse, it's just > that you don't want to hold it all in memory at once. > > Examples: > > - "a_array *= np.random.normal(size=a_array.shape)". The right hand side > can be generated on the fly (making it return the same data for each block > every time is non-trivial but can be done). If a takes up a good chunk of > RAM, there might not even be enough memory for the right-hand-side except > for block-by-block. (That saves bandwidth too.) > > - You want to stream from one file to another file, neither of which will > fit in RAM. Here we really don't care about speed, just code reuse...it's > irritating to have to manually block in such a situation. > > - You want to play with a new fancy array format (a blocked format that's > faster for some linear algebra, say). But then you need to call another C > library that takes data+shape+stride+dtype. Then you need to make a copy, > which you'd rather avoid -- so an alternative is to make that C library be > based on the block API so that it can interfaces with your fancy new format > (and anything else). > > 2) Bandwidth conservation. Numexpr, Theano, Numba and > Cython-vectorized-expressions all already deal with this problem on one > level. > > However, I also believe there's a higher level in many programs where blocks > come into play. The organization of many scientific codes essentially reads > "do A on 8 GB of data, then do B on 8 GB of data"; and it's going to be a > *long* time before you have full-program analysis to block up that in every > case; the tools above will be able to deal with some almost-trivial cases > only. > > A block API could be used to write "pipeline programs". For instance, > consider > > a_array[:, None] * f(x_array) > > and f is some rather complicated routine in Fortran that is NOT a ufunc -- > it takes all of x_array as input and provides all the output "at once"; but > at some point it needs to do the writing of the output, and if writing to an > API it could do the multiplication with a_array while the block is in cache > anyway and save a memory bus round-trip. (Provided the output isn't needed > as scratch space.) > > Summing up: > > Vectorized expressions on contiguous (or strided) memory in Cython is pretty > nice by itself; it would bring us up to the current state of the art of > static compilation (in Fortran compilers). > > But my hunch is your sparse-block proposal doesn't add enough flexibility to > be worth the pain. Many of the cases above can't be covered with it. > > If it's possible to identify a little nugget API that is forward-compatible > with the above usecases (it wouldn't solve them perhaps, but allow them to > be solved with some extra supporting code), then it might be worth to a) > implement it in NumPy, b) support it for Cython vectorized expressions; and > use that to support block-transposition. > > But I'm starting to feel that we can't really know how that little nugget > API should really look until the above has been explored in a little more > depth; and then we are easily talking a too large scope without tying it to > a larger project (which can't really before the GSoC..). For instance, 2) > suggests push/pull rather than put/get. > > Dag
I assumed as much, in any case I was going to start with dense arrays, simply because they are in common use and well-defined at this point. Maybe what we really want is just lazy evaluation that works for different array layouts and storage mechanisms, and a smart JIT that can evaluate our linear algebra and array expressions in an optimal fashion (memory/cache-wise, but also out-of-core wise). This probably means writing everything in a high level language, because even if your Fortran routines themselves would use your lazy evaluation library, your linear algebra library wouldn't for sure :) Anyway, I'm going to be pragmatic and investigate how much of Theano we can reuse in Cython now. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion