Hi all, I'm working on a new data type for numpy to represent arrays of variable-width strings [1]. One limitation of using this data type right now is it's not possible to write idiomatic cython code operating on the array, instead one would need to use e.g. the NumPy iterator API. It turns out this is a papercut that's been around for a while and is most noticeable downstream because datetime arrays cannot be passed to Cython.
Here's an example of a downstream library working around lack of support in Cython for datetimes using an iterator: [2]. Pandas works around this by passing int64 views of the arrays to Cython. I think this issue will become more problematic in the future when NumPy officially ships the NEP 42 custom dtype API, which will make it much easier to develop custom data types. It is also already an issue for the legacy custom data types numpy already supports [3], but those aren't very popular so it hasn't come up much. I'm curious if there's any appetite among the Cython developers to ultimately make it easier to write cython code that works with numpy arrays that have user-defined data types. Currently it's only possible to write code using the numpy or typed memoryview interfaces for arrays with datatypes that support the buffer protocol. See e.g. https://github.com/numpy/numpy/issues/4983. One approach to fix this would be to either officially or unofficially extend the buffer protocol to allow arbitrary typecodes to be sent in the format string. Officially Python only supports format codes used in the struct module, but in practice you can put any string in the format code and memoryview will accept it. Of course for it actually to be useful NumPy would need to create format codes that allow cython to correctly read and reconstruct the type. Sebastian Berg proposed this on the CPython discussion forum [3] and there hasn't been much response from upstream. I response to Sebastian, Michael Droettboom suggested [4] using the Arrow data format, which has rich support for various array memory layouts and has support for exchanging custom extension types [5]. The main problem with the buffer protocol approach is defining the protocol in such a way that Cython can appropriately reconstruct the memory layout for the data type (although only supporting strided arrays at first makes a lot of sense) for an arbitrary user-defined data type, ideally without needing to import any code defining the data type. The main problem with the approach using Apache Arrow is neither Cython or Numpy has any support for it and I don't think either library can depend on Arrow so both would need to write custom serializers and parsers, whereas Cython already has memoryviews fully working. Guido van Rossum wanted some more discussion about this, so I'm raising this as an issue here in case any Cython developers are interested. Please chime in on the python disourse thread if so. -Nathan [1] https://github.com/numpy/numpy-user-dtypes/tree/main/stringdtype [2] https://github.com/scikit-hep/awkward/issues/367 [3] https://github.com/numpy/numpy/issues/18442 [4] https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256 [5] https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256/3 [6] https://arrow.apache.org/docs/format/Columnar.html
_______________________________________________ cython-devel mailing list cython-devel@python.org https://mail.python.org/mailman/listinfo/cython-devel