[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

Kevin Sheppard Wed, 20 Sep 2023 03:41:44 -0700

On Wed, Sep 20, 2023 at 11:23 AM Ralf Gommers <ralf.gomm...@gmail.com>
wrote:


>
>
> On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser <
> warren.weckes...@gmail.com> wrote:
>
>>
>>
>> On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser <
>> warren.weckes...@gmail.com> wrote:
>> >
>> >
>> >
>> > On Mon, Sep 11, 2023 at 12:25 PM Nathan <nathan.goldb...@gmail.com>
>> wrote:
>> >>
>> >>
>> >>
>> >> On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser <
>> warren.weckes...@gmail.com> wrote:
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldb...@gmail.com>
>> wrote:
>> >>> >
>> >>> > The NEP was merged in draft form, see below.
>> >>> >
>> >>> > https://numpy.org/neps/nep-0055-string_dtype.html
>> >>> >
>> >>> > On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldb...@gmail.com>
>> wrote:
>> >>> >>
>> >>> >> Hello all,
>> >>> >>
>> >>> >> I just opened a pull request to add NEP 55, see
>> https://github.com/numpy/numpy/pull/24483.
>> >>> >>
>> >>> >> Per NEP 0, I've copied everything up to the "detailed description"
>> section below.
>> >>> >>
>> >>> >> I'm looking forward to your feedback on this.
>> >>> >>
>> >>> >> -Nathan Goldbaum
>> >>> >>
>> >>>
>> >>> This will be a nice addition to NumPy, and matches a suggestion by
>> >>> @rkern (and probably others) made in the 2017 mailing list thread;
>> >>> see the last bullet of
>> >>>
>> >>>
>> https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html
>> >>>
>> >>> So +1 for the enhancement!
>> >>>
>> >>> Now for some nitty-gritty review...
>> >>
>> >>
>> >> Thanks for the nitty-gritty review! I was on vacation last week and
>> haven't had a chance to look over this in detail yet, but at first glance
>> this seems like a really nice improvement.
>> >>
>> >> I'm going to try to integrate your proposed design into the dtype
>> prototype this week. If that works, I'd like to include some of the text
>> from the README in your repo in the NEP and add you as an author, would
>> that be alright?
>> >
>> >
>> >
>> > Sure, that would be fine.
>> >
>> > I have a few more comments and questions about the NEP that I'll finish
>> up and send this weekend.
>> >
>>
>> One more comment on the NEP...
>>
>> My first impression of the missing data API design is that
>> it is more complicated than necessary. An alternative that
>> is simpler--and is consistent with the pattern established for
>> floats and datetimes--is to define a "not a string" value, say
>> `np.nastring` or something similar, just like we have `nan` for
>> floats and `nat` for datetimes. Its behavior could be what
>> you called "nan-like".
>>
>
> Float `np.nan` and datetime missing value sentinel are not all that
> similar, and the latter was always a bit questionable (at least partially
> it's a left-over of trying to introduce generic missing value support I
> believe). `nan` is a float and part of C/C++ standards with well-defined
> numerical behavior. In contrast, there is no `np.nat`; you can retrieve a
> sentinel value with `np.datetime64("NaT")` only. I'm not sure if it's
> possible to generate a NaT value with a regular operation on a datetime
> array a la `np.array([1.5]) / 0.0`.
>
> The handling of `np.nastring` would be an intrinsic part of the
>> dtype, so there would be no need for the `na_object` parameter
>> of `StringDType`. All `StringDType`s would handle `np.nastring`
>> in the same consistent manner.
>>
>> The use-case for the string sentinel does not seem very
>> compelling (but maybe I just don't understand the use-cases).
>> If there is a real need here that is not covered by
>> `np.nastring`, perhaps just a flag to control the repr of
>> `np.nastring` for each StringDType instance would be enough?
>>
>
> My understanding is that the NEP provides the necessary but limited
> support to allow Pandas to adopt the new dtype. The scope section of the
> NEP says: "Fully agreeing on the semantics of a missing data sentinels or
> adding a missing data sentinel to NumPy itself.". And then further down:
> "By only supporting user-provided missing data sentinels, we avoid
> resolving exactly how NumPy itself should support missing data and the
> correct semantics of the missing data object, leaving that up to users to
> decide"
>
> That general approach I agree with, it's a large can of worms and not the
> main purpose of this NEP. Nathan may have more thoughts about what, if
> anything, from your suggestions could be adopted, but the general "let's
> introduce a missing value thing" is a path we should not go down here imho.
>
>
>
>>
>> If there is an objection to a potential proliferation of
>> "not a thing" special values, one for each type that can
>> handle them, then perhaps a generic "not a value" (say
>> `np.navalue`) could be created that, when assigned to an
>> element of an array, results in the appropriate "not a thing"
>> value actually being assigned. In a sense, I guess this NEP is
>> proposing that, but it is reusing the floating point object
>> `np.nan` as the generic "not a thing" value
>>
>
> It is explicitly not using `np.nan` but instead allowing the user to
> provide their preferred sentinel. You're probably referring to the example
> with `na_object=np.nan`, but that example would work with another sentinel
> value too.
>
> Cheers,
> Ralf
>
>
>
>> , and my preference
>> is that, *if* we go with such a generic object, it is not
>> the floating point value `nan` but a new thing with a name
>> that reflects its purpose. (I guess Pandas users might be
>> accustomed to `nan` being a generic sentinel for missing data,
>> so its use doesn't feel as incohesive as it might to others.
>> Passing a string array to `np.isnan()` just feels *wrong* to
>> me.)
>>
>> Any, that's my 2¢.
>>
>> Warren
>>
>>
>>
>
I was a bit surprised that len was not used as part of the missing value.
The NEP proposal that 0 is a empty string unless there is a sentinal in
which case it is a missing value feels pretty limiting, since these are
distinctly different things.

Would it make sense for len<0 to indicate a missing value.  This would
require using ssize_t instead of size_t, and would then limit the string
size. In principle this would allow for sizeof(ssize_t) / 2 distinct
missing value.  I think ssize_t is well-defined on all platforms
targeted by NumPy.

Kevin

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

Reply via email to