So my superficial thoughts:
1. The buffer protocol has two bits. The first says "given this
predictable memory layout, you can look up an item in memory with these
rules"; the second describes what the items in memory are. I think
you're only proposing to change the second part of it. I'd encourage you
not to change the first part - the nice thing about the first part is
that it's relatively simple and doesn't try to do anything. For example
I'd be sceptical about trying to support ragged arrays.
2. As you identify, for a more advanced memoryview to be useful in
Cython, Cython really has to be able to know an underlying C type for
your data at compile-time and be able to validate that the buffer it's
passed matches that C type at runtime. The validation could have varying
degrees of strictness (i.e. in the worst case we could just check the
size matches and trust the user). We already support that to extent
(packed structs with structured arrays) but that doesn't cover everything
3. For your variable length string example, the C struct to use is
fairly obvious (just your `struct ss`). The difficult bit is likely to
be memory management of that. I'd kind of encourage you not to expect
Cython to handle the memory management for this type of thing (i.e. it
can expose the struct to the user, but it becomes the user's own problem
to work out if they need to allocate memory when they modify the struct).
5. Things like the datetime for Pandas, or a way of having a float16
type seems like the sort of thing we should definitely be able to do.
6. In terms of Apache Arrow - if there was demand we probably could add
support for it. Their documentation says: "The Arrow C data interface
defines a very small, stable set of C definitions that can be easily
/copied/ in any project’s source code" - so that suggests it need not be
a dependency.
7. One of the points of the "typed memoryview" vs the older "np.ndarray"
interface is that it was supposed to be more generally compatible.
While we could extend it to match any non-standard additions that Numpy
tries to make, that does feel dodgy and likely to conflict when other
projects do their own thing. I think it would be better if the Python
standard could be extended (even if it was just something like a code to
indicate "mystery structure of size X")
Don't know if these thoughts are useful. They're a bit scattered. I
guess the summary is "we could definitely do more with custom data
types, but don't break the things that made the buffer protocol nice".
David
On 06/07/2023 17:43, Nathan wrote:
Hi all,
I'm working on a new data type for numpy to represent arrays of
variable-width strings [1]. One limitation of using this data type
right now is it's not possible to write idiomatic cython code
operating on the array, instead one would need to use e.g. the NumPy
iterator API. It turns out this is a papercut that's been around for a
while and is most noticeable downstream because datetime arrays cannot
be passed to Cython.
Here's an example of a downstream library working around lack of
support in Cython for datetimes using an iterator: [2]. Pandas works
around this by passing int64 views of the arrays to Cython. I think
this issue will become more problematic in the future when NumPy
officially ships the NEP 42 custom dtype API, which will make it much
easier to develop custom data types. It is also already an issue for
the legacy custom data types numpy already supports [3], but those
aren't very popular so it hasn't come up much.
I'm curious if there's any appetite among the Cython developers to
ultimately make it easier to write cython code that works with numpy
arrays that have user-defined data types. Currently it's only possible
to write code using the numpy or typed memoryview interfaces for
arrays with datatypes that support the buffer protocol. See e.g.
https://github.com/numpy/numpy/issues/4983.
One approach to fix this would be to either officially or unofficially
extend the buffer protocol to allow arbitrary typecodes to be sent in
the format string. Officially Python only supports format codes used
in the struct module, but in practice you can put any string in the
format code and memoryview will accept it. Of course for it actually
to be useful NumPy would need to create format codes that allow cython
to correctly read and reconstruct the type.
Sebastian Berg proposed this on the CPython discussion forum [3] and
there hasn't been much response from upstream. I response to
Sebastian, Michael Droettboom suggested [4] using the Arrow data
format, which has rich support for various array memory layouts and
has support for exchanging custom extension types [5].
The main problem with the buffer protocol approach is defining the
protocol in such a way that Cython can appropriately reconstruct the
memory layout for the data type (although only supporting strided
arrays at first makes a lot of sense) for an arbitrary user-defined
data type, ideally without needing to import any code defining the
data type.
The main problem with the approach using Apache Arrow is neither
Cython or Numpy has any support for it and I don't think either
library can depend on Arrow so both would need to write custom
serializers and parsers, whereas Cython already has memoryviews fully
working.
Guido van Rossum wanted some more discussion about this, so I'm
raising this as an issue here in case any Cython developers are
interested. Please chime in on the python disourse thread if so.
-Nathan
[1] https://github.com/numpy/numpy-user-dtypes/tree/main/stringdtype
[2] https://github.com/scikit-hep/awkward/issues/367
[3] https://github.com/numpy/numpy/issues/18442
[4]
https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256
[5]
https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256/3
[6] https://arrow.apache.org/docs/format/Columnar.html
_______________________________________________
cython-devel mailing list
cython-devel@python.org
https://mail.python.org/mailman/listinfo/cython-devel
_______________________________________________
cython-devel mailing list
cython-devel@python.org
https://mail.python.org/mailman/listinfo/cython-devel