On 6/7/23 20:44, Evgeni Burovski wrote:
On Thu, Jul 6, 2023 at 7:56 PM Nathan <nathan.goldb...@gmail.com> wrote:
>
> Hi all,
>
> As you may know, I'm currently working on a variable-width string
dtype using the new experimental user dtype API. As part of this work
I'm running into papercuts that future dtype authors will likely hit
and I've been trying to fix them as I go.
>
> One issue I'd like to raise with the list is that the Python buffer
protocol and the `__array_interface__` protocol support a limited set
of data types.
>
> This leads to three concrete issues I'm working around:
>
> * The `npy` file format uses the type strings defined by the
`__array_interface__` protocol, so any type that doesn't have a type
string defined in that protocol cannot currently be saved [1].
>
> * Cython uses the buffer protocol in its support for numpy
arrays and in the typed memoryview interface so that means any array
with a dtype that doesn't support the buffer protocol cannot be
accessed using idiomatic cython code [2]. The same issue means cython
can't easily support float16 or datetime dtypes [3].
>
> * Currently new dtypes don't have a way to export a string
version of themselves that numpy can subsequently load (implicitly
importing the dtype). This makes it more awkward to update downstream
libraries that currently treat dtypes as strings.
>
> One way to fix this is to define an ad-hoc extension to the buffer
protocol. Officially, the buffer protocol only supports the format
codes used in the struct module [4]. Unofficially, memoryview doesn't
raise a NotImplementedError if you pass it an invalid format code,
only raising an error when it tries to access the data. This means we
can stuff an arbitrary string into the format code. See the proposal
from Sebastian on the Python Discuss forum [5] and his
proof-of-concept [6]. The hardest issue with this approach is that
it's a social problem, requiring cross-project coordination with at
least Cython, and possibly a PEP to standardize whatever extension to
the buffer protocol we come up with.
>
> Another option would be to exchange data using the arrow data format
[7], which already supports many of the kinds of memory layouts custom
dtype authors might want to use and supports defining custom data
types [8]. The big issue here is that NumPy probably can't depend on
the arrow C++ library (I think?) so we would need to write a bunch of
code to support arrow data layouts and data types, but then we would
also need to do the same thing on the Cython side.
>
> Implementing either of these approaches fixes the issues I
enumerated above at the cost of some added complexity. We don't
necessarily have to make an immediate decision for my work to be
viable, I can work around most of these issues, but I think now is
probably the time to raise this as an issue and see if anyone has
strong opinions about what NumPy should ultimately do.
>
> I've raised this on the Cython mailing list to get their take as
well [9].
>
> [1] https://github.com/numpy/numpy/issues/24110
> [2] https://github.com/numpy/numpy/issues/18442
> [3] https://github.com/numpy/numpy/issues/4983
> [4] https://docs.python.org/3/library/struct.html#format-strings
> [5]
https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256
> [6] https://github.com/numpy/numpy/issues/23500#issuecomment-1525103546
> [7] https://arrow.apache.org/docs/format/Columnar.html
> [8] https://arrow.apache.org/docs/format/Columnar.html#extension-types
> [9] https://mail.python.org/pipermail/cython-devel/2023-July/005434.html
I wonder if the dlpack protocol can be helpful for these kinds of dtypes?
No. DLPack has an enum for a fixed number of known dtypes [0], and
adding new ones is non-trivial.
[0]
https://github.com/dmlc/dlpack/blob/ca4d00ad3e2e0f410eeab3264d21b8a39397f362/include/dlpack/dlpack.h#L158
Matti
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com