So my superficial thoughts:

1. The buffer protocol has two bits. The first says "given this predictable memory layout, you can look up an item in memory with these rules"; the second describes what the items in memory are. I think you're only proposing to change the second part of it. I'd encourage you not to change the first part - the nice thing about the first part is that it's relatively simple and doesn't try to do anything. For example I'd be sceptical about trying to support ragged arrays.

2. As you identify, for a more advanced memoryview to be useful in Cython, Cython really has to be able to know an underlying C type for your data at compile-time and be able to validate that the buffer it's passed matches that C type at runtime. The validation could have varying degrees of strictness (i.e. in the worst case we could just check the size matches and trust the user). We already support that to extent (packed structs with structured arrays) but that doesn't cover everything

3. For your variable length string example, the C struct to use is fairly obvious (just your `struct ss`). The difficult bit is likely to be memory management of that. I'd kind of encourage you not to expect Cython to handle the memory management for this type of thing (i.e. it can expose the struct to the user, but it becomes the user's own problem to work out if they need to allocate memory when they modify the struct).

5. Things like the datetime for Pandas, or a way of having a float16 type seems like the sort of thing we should definitely be able to do.

6. In terms of Apache Arrow - if there was demand we probably could add support for it. Their documentation says: "The Arrow C data interface defines a very small, stable set of C definitions that can be easily /copied/ in any project’s source code" - so that suggests it need not be a dependency.

7. One of the points of the "typed memoryview" vs the older "np.ndarray" interface is that it was supposed to be more generally compatible.  While we could extend it to match any non-standard additions that Numpy tries to make, that does feel dodgy and likely to conflict when other projects do their own thing. I think it would be better if the Python standard could be extended (even if it was just something like a code to indicate "mystery structure of size X")

Don't know if these thoughts are useful. They're a bit scattered. I guess the summary is "we could definitely do more with custom data types, but don't break the things that made the buffer protocol nice".

David


On 06/07/2023 17:43, Nathan wrote:
Hi all,

I'm working on a new data type for numpy to represent arrays of variable-width strings [1]. One limitation of using this data type right now is it's not possible to write idiomatic cython code operating on the array, instead one would need to use e.g. the NumPy iterator API. It turns out this is a papercut that's been around for a while and is most noticeable downstream because datetime arrays cannot be passed to Cython.

Here's an example of a downstream library working around lack of support in Cython for datetimes using an iterator: [2]. Pandas works around this by passing int64 views of the arrays to Cython. I think this issue will become more problematic in the future when NumPy officially ships the NEP 42 custom dtype API, which will make it much easier to develop custom data types. It is also already an issue for the legacy custom data types numpy already supports [3], but those aren't very popular so it hasn't come up much.

I'm curious if there's any appetite among the Cython developers to ultimately make it easier to write cython code that works with numpy arrays that have user-defined data types. Currently it's only possible to write code using the numpy or typed memoryview interfaces for arrays with datatypes that support the buffer protocol. See e.g. https://github.com/numpy/numpy/issues/4983.

One approach to fix this would be to either officially or unofficially extend the buffer protocol to allow arbitrary typecodes to be sent in the format string. Officially Python only supports format codes used in the struct module, but in practice you can put any string in the format code and memoryview will accept it. Of course for it actually to be useful NumPy would need to create format codes that allow cython to correctly read and reconstruct the type.

Sebastian Berg proposed this on the CPython discussion forum [3] and there hasn't been much response from upstream. I response to Sebastian, Michael Droettboom suggested [4] using the Arrow data format, which has rich support for various array memory layouts and has support for exchanging custom extension types [5].

The main problem with the buffer protocol approach is defining the protocol in such a way that Cython can appropriately reconstruct the memory layout for the data type (although only supporting strided arrays at first makes a lot of sense) for an arbitrary user-defined data type, ideally without needing to import any code defining the data type.

The main problem with the approach using Apache Arrow is neither Cython or Numpy has any support for it and I don't think either library can depend on Arrow so both would need to write custom serializers and parsers, whereas Cython already has memoryviews fully working.

Guido van Rossum wanted some more discussion about this, so I'm raising this as an issue here in case any Cython developers are interested. Please chime in on the python disourse thread if so.

-Nathan

[1] https://github.com/numpy/numpy-user-dtypes/tree/main/stringdtype
[2] https://github.com/scikit-hep/awkward/issues/367
[3] https://github.com/numpy/numpy/issues/18442
[4] https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256 [5] https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256/3
[6] https://arrow.apache.org/docs/format/Columnar.html

_______________________________________________
cython-devel mailing list
cython-devel@python.org
https://mail.python.org/mailman/listinfo/cython-devel

_______________________________________________
cython-devel mailing list
cython-devel@python.org
https://mail.python.org/mailman/listinfo/cython-devel

Reply via email to