Just to start the conversation, and to find out who is interested, I would like
to informally propose generator arrays for NumPy 2.0. This concept has as
one use-case, the deferred arrays that Mark Wiebe has proposed. But, it also
allows for "compressed arrays", on-the-fly computed arrays, and streamed or
Basically, the modification I would like to make is to have an array flag
(MEMORY) that when set means that the data attribute of a numpy array is a
pointer to the address in memory where the data begins with the strides
attribute pointing to a C-array of integers (in other words, all current arrays
are MEMORY arrays)
But, when the MEMORY flag is not set, the data attribute instead points to a
length-2 C-array of pointers to functions
[read(N, output_address, self->index_iter, self->extra), write(N,
input_address, self->index_iter, self->extra)]
Either of these could then be NULL (i.e. if write is NULL, then the array must
When the MEMORY flag is not set, the strides member of the ndarray structure is
a pointer to the index_iter object (which could be anything that the particular
read and write methods need it to be).
The array structure should also get a member to hold the "extra" argument
(which would hold any state that the array needed to hold on to in order to
correctly perform the read or write operations --- i.e. it could hold an
execution graph for deferred evaluation).
The index_iter structure is anything that the read and write methods need to
correctly identify *where* to write. Now, clearly, we could combine
index_iter and extra into just one "structure" that holds all needed state for
read and write to work correctly. The reason I propose two slots is because
at least mentally in the use case of having these structures be calculation
graphs, one of these structures is involved in "computing the location to
read/write" and the other is involved in "computing what to read/write"
The idea is fairly simple, but with some very interesting potential features:
* lazy evaluation (of indexing, ufuncs, etc.)
* fancy indexing as views instead of copies (really just another
example of lazy evaluation)
* compressed arrays
* generated arrays (from computation or streamed data)
* infinite arrays
* computed arrays
* missing-data arrays
* ragged arrays (shape would be the bounding box --- which makes me
think of ragged arrays as examples of masked arrays).
* arrays that view PIL data.
One could build an array with a (logically) infinite number of elements (we
could use -2 in the shape tuple to indicate that).
We don't need examples of all of these features for NumPy 2.0 to be released,
because to really make this useful, we would need to modify all "calculation"
code to produce a NON MEMORY array. What to do here still needs a lot of
thought and experimentation.
But, I can think about a situation where all NumPy calculations that produce
arrays provide the option that when they are done inside of a particular
context, a user-supplied behavior over-rides the default return. I want to
study what Mark is proposing and understand his new iterator at a deeper level
before providing more thoughts here.
That's the gist of what I am thinking about. I would love feedback and
The other things I would like to see in NumPy 2.0 that have not been discussed
lately (that could affect the ABI) are:
* a geometry member to the data structure (that allows labels to
dimensions and axes to be provided -- ala data_array)
* small array performance improvements that Mark Wiebe has suggested
(including the addition of an optional low-level loop that is used when you
have contiguous data)
* completed datetime implementation
* pointer data-types (i.e. the memory location holds a pointer to
another part of an ndarray) --- very useful for "join" - type arrays
If anybody is interested in helping with any of these (and has time to do it,
let me know). Some of this I could fund (especially if you are willing to
come to Austin and be an intern for Enthought).
P.S. I hope to have more time this year to hang-out here on the
numpy-discussion list (but we will see....)
NumPy-Discussion mailing list