On Wed, Sep 20, 2023 at 11:23 AM Ralf Gommers <ralf.gomm...@gmail.com> wrote:
> > > On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser < > warren.weckes...@gmail.com> wrote: > >> >> >> On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser < >> warren.weckes...@gmail.com> wrote: >> > >> > >> > >> > On Mon, Sep 11, 2023 at 12:25 PM Nathan <nathan.goldb...@gmail.com> >> wrote: >> >> >> >> >> >> >> >> On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser < >> warren.weckes...@gmail.com> wrote: >> >>> >> >>> >> >>> >> >>> On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldb...@gmail.com> >> wrote: >> >>> > >> >>> > The NEP was merged in draft form, see below. >> >>> > >> >>> > https://numpy.org/neps/nep-0055-string_dtype.html >> >>> > >> >>> > On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldb...@gmail.com> >> wrote: >> >>> >> >> >>> >> Hello all, >> >>> >> >> >>> >> I just opened a pull request to add NEP 55, see >> https://github.com/numpy/numpy/pull/24483. >> >>> >> >> >>> >> Per NEP 0, I've copied everything up to the "detailed description" >> section below. >> >>> >> >> >>> >> I'm looking forward to your feedback on this. >> >>> >> >> >>> >> -Nathan Goldbaum >> >>> >> >> >>> >> >>> This will be a nice addition to NumPy, and matches a suggestion by >> >>> @rkern (and probably others) made in the 2017 mailing list thread; >> >>> see the last bullet of >> >>> >> >>> >> https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html >> >>> >> >>> So +1 for the enhancement! >> >>> >> >>> Now for some nitty-gritty review... >> >> >> >> >> >> Thanks for the nitty-gritty review! I was on vacation last week and >> haven't had a chance to look over this in detail yet, but at first glance >> this seems like a really nice improvement. >> >> >> >> I'm going to try to integrate your proposed design into the dtype >> prototype this week. If that works, I'd like to include some of the text >> from the README in your repo in the NEP and add you as an author, would >> that be alright? >> > >> > >> > >> > Sure, that would be fine. >> > >> > I have a few more comments and questions about the NEP that I'll finish >> up and send this weekend. >> > >> >> One more comment on the NEP... >> >> My first impression of the missing data API design is that >> it is more complicated than necessary. An alternative that >> is simpler--and is consistent with the pattern established for >> floats and datetimes--is to define a "not a string" value, say >> `np.nastring` or something similar, just like we have `nan` for >> floats and `nat` for datetimes. Its behavior could be what >> you called "nan-like". >> > > Float `np.nan` and datetime missing value sentinel are not all that > similar, and the latter was always a bit questionable (at least partially > it's a left-over of trying to introduce generic missing value support I > believe). `nan` is a float and part of C/C++ standards with well-defined > numerical behavior. In contrast, there is no `np.nat`; you can retrieve a > sentinel value with `np.datetime64("NaT")` only. I'm not sure if it's > possible to generate a NaT value with a regular operation on a datetime > array a la `np.array([1.5]) / 0.0`. > > The handling of `np.nastring` would be an intrinsic part of the >> dtype, so there would be no need for the `na_object` parameter >> of `StringDType`. All `StringDType`s would handle `np.nastring` >> in the same consistent manner. >> >> The use-case for the string sentinel does not seem very >> compelling (but maybe I just don't understand the use-cases). >> If there is a real need here that is not covered by >> `np.nastring`, perhaps just a flag to control the repr of >> `np.nastring` for each StringDType instance would be enough? >> > > My understanding is that the NEP provides the necessary but limited > support to allow Pandas to adopt the new dtype. The scope section of the > NEP says: "Fully agreeing on the semantics of a missing data sentinels or > adding a missing data sentinel to NumPy itself.". And then further down: > "By only supporting user-provided missing data sentinels, we avoid > resolving exactly how NumPy itself should support missing data and the > correct semantics of the missing data object, leaving that up to users to > decide" > > That general approach I agree with, it's a large can of worms and not the > main purpose of this NEP. Nathan may have more thoughts about what, if > anything, from your suggestions could be adopted, but the general "let's > introduce a missing value thing" is a path we should not go down here imho. > > > >> >> If there is an objection to a potential proliferation of >> "not a thing" special values, one for each type that can >> handle them, then perhaps a generic "not a value" (say >> `np.navalue`) could be created that, when assigned to an >> element of an array, results in the appropriate "not a thing" >> value actually being assigned. In a sense, I guess this NEP is >> proposing that, but it is reusing the floating point object >> `np.nan` as the generic "not a thing" value >> > > It is explicitly not using `np.nan` but instead allowing the user to > provide their preferred sentinel. You're probably referring to the example > with `na_object=np.nan`, but that example would work with another sentinel > value too. > > Cheers, > Ralf > > > >> , and my preference >> is that, *if* we go with such a generic object, it is not >> the floating point value `nan` but a new thing with a name >> that reflects its purpose. (I guess Pandas users might be >> accustomed to `nan` being a generic sentinel for missing data, >> so its use doesn't feel as incohesive as it might to others. >> Passing a string array to `np.isnan()` just feels *wrong* to >> me.) >> >> Any, that's my 2¢. >> >> Warren >> >> >> > I was a bit surprised that len was not used as part of the missing value. The NEP proposal that 0 is a empty string unless there is a sentinal in which case it is a missing value feels pretty limiting, since these are distinctly different things. Would it make sense for len<0 to indicate a missing value. This would require using ssize_t instead of size_t, and would then limit the string size. In principle this would allow for sizeof(ssize_t) / 2 distinct missing value. I think ssize_t is well-defined on all platforms targeted by NumPy. Kevin
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com