[Numpy-discussion] Generator arrays

Travis Oliphant Thu, 27 Jan 2011 16:01:36 -0800

Just to start the conversation, and to find out who is interested, I would like 
to informally propose generator arrays for NumPy 2.0.     This concept has as 
one use-case, the deferred arrays that Mark Wiebe has proposed.  But, it also 
allows for "compressed arrays", on-the-fly computed arrays, and streamed or 
generated arrays.


Basically, the modification I would like to make is to have an array flag 
(MEMORY) that when set means that the data attribute of a numpy array is a 
pointer to the address in memory where the data begins with the strides 
attribute pointing to a C-array of integers (in other words, all current arrays 
are MEMORY arrays)

But, when the MEMORY flag is not set, the data attribute instead points to a 
length-2 C-array of pointers to functions 

        [read(N, output_address, self->index_iter, self->extra),  write(N, 
input_address, self->index_iter, self->extra)]

Either of these could then be NULL (i.e. if write is NULL, then the array must 
be read-only). 

When the MEMORY flag is not set, the strides member of the ndarray structure is 
a pointer to the index_iter object (which could be anything that the particular 
read and write methods need it to be).  

The array structure should also get a member to hold the "extra" argument 
(which would hold any state that the array needed to hold on to in order to 
correctly perform the read or write operations --- i.e. it could hold an 
execution graph for deferred evaluation).   

The index_iter structure is anything that the read and write methods need to 
correctly identify *where* to write.   Now, clearly, we could combine 
index_iter and extra into just one "structure" that holds all needed state for 
read and write to work correctly.   The reason I propose two slots is because 
at least mentally in the use case of having these structures be calculation 
graphs, one of these structures is involved in "computing the location to 
read/write" and the other is involved in "computing what to read/write"

The idea is fairly simple, but with some very interesting potential features: 

        * lazy evaluation (of indexing, ufuncs, etc.)
        * fancy indexing as views instead of copies (really just another 
example of lazy evaluation)
        * compressed arrays 
        * generated arrays (from computation or streamed data)
        * infinite arrays
        * computed arrays
        * missing-data arrays
        * ragged arrays (shape would be the bounding box --- which makes me 
think of ragged arrays as examples of masked arrays). 
        * arrays that view PIL data.

One could build an array with a (logically) infinite number of elements (we 
could use -2 in the shape tuple to indicate that). 

We don't need examples of all of these features for NumPy 2.0 to be released, 
because to really make this useful, we would need to modify all "calculation" 
code to produce a NON MEMORY array.     What to do here still needs a lot of 
thought and experimentation.    

But, I can think about a situation where all NumPy calculations that produce 
arrays provide the option that when they are done inside of a particular 
context,  a user-supplied behavior over-rides the default return.   I want to 
study what Mark is proposing and understand his new iterator at a deeper level 
before providing more thoughts here. 

That's the gist of what I am thinking about.   I would love feedback and 
comments. 

The other things I would like to see in NumPy 2.0 that have not been discussed 
lately (that could affect the ABI) are: 

        * a geometry member to the data structure (that allows labels to 
dimensions and axes to be provided -- ala data_array)
        * small array performance improvements that Mark Wiebe has suggested 
(including the addition of an optional low-level loop that is used when you 
have contiguous data)
        * completed datetime implementation
        * pointer data-types (i.e. the memory location holds a pointer to 
another part of an ndarray) --- very useful for "join" - type arrays

If anybody is interested in helping with any of these (and has time to do it, 
let me know).   Some of this I could fund (especially if you are willing to 
come to Austin and be an intern for Enthought). 

Best regards,

-Travis


P.S.   I hope to have more time this year to hang-out here on the 
numpy-discussion list (but we will see....)










_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Generator arrays

Reply via email to