On Mon, Sep 11, 2023 at 12:25 PM Nathan <nathan.goldb...@gmail.com> wrote:
> > > On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser < > warren.weckes...@gmail.com> wrote: > >> >> >> On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldb...@gmail.com> >> wrote: >> > >> > The NEP was merged in draft form, see below. >> > >> > https://numpy.org/neps/nep-0055-string_dtype.html >> > >> > On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldb...@gmail.com> >> wrote: >> >> >> >> Hello all, >> >> >> >> I just opened a pull request to add NEP 55, see >> https://github.com/numpy/numpy/pull/24483. >> >> >> >> Per NEP 0, I've copied everything up to the "detailed description" >> section below. >> >> >> >> I'm looking forward to your feedback on this. >> >> >> >> -Nathan Goldbaum >> >> >> >> This will be a nice addition to NumPy, and matches a suggestion by >> @rkern (and probably others) made in the 2017 mailing list thread; >> see the last bullet of >> >> >> https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html >> >> So +1 for the enhancement! >> >> Now for some nitty-gritty review... >> > > Thanks for the nitty-gritty review! I was on vacation last week and > haven't had a chance to look over this in detail yet, but at first glance > this seems like a really nice improvement. > > I'm going to try to integrate your proposed design into the dtype > prototype this week. If that works, I'd like to include some of the text > from the README in your repo in the NEP and add you as an author, would > that be alright? > Sure, that would be fine. I have a few more comments and questions about the NEP that I'll finish up and send this weekend. Warren > > >> >> There is a design change that I think should be made in the >> implementation of missing values. >> >> In the current design described in the NEP, and expanded on in the >> comment >> >> https://github.com/numpy/numpy/pull/24483#discussion_r1311815944, >> >> the meaning of the values `{len = 0, buf = NULL}` in an instance of >> `npy_static_string` depends on whether or not the `na_object` has been >> set in the dtype. If it has not been set, that data represents a string >> of length 0. If `na_object` *has* been set, that data represents a >> missing value. To get a string of length 0 in this case, some non-NULL >> value must be assigned to the `buf` field. (In the comment linked >> above, @ngoldbaum suggested `{0, "\0"}`, but strings are not >> NUL-terminated, so there is no need for that `\0` in `buf`, and in fact, >> with `len == 0`, it would be a bug for the pointer to be dereferenced, >> so *any* non-NULL value--valid pointer or not--could be used for `buf`.) >> >> I think it would be better if `len == 0` *always* meant a string with >> length 0, with no additional qualifications; it shouldn't be necessary >> to put some non-NULL value in `buf` just to get an empty string. We >> can achieve this if we use a bit in `len` as a flag for a missing value. >> Reserving a bit from `len` as a flag reduces the maximum possible string >> length, but as discussed in the NEP pull request, we're almost certainly >> going to reserve at least the high bit of `len` when small string >> optimization (SSO) is implemented. This will reduce the maximum string >> length to `2**(N-1)-1`, where `N` is the bit width of `size_t` >> (equivalent to using a signed type for `len`). Even if SSO isn't >> implemented immediately, we can anticipate the need for flags stored >> in `len`, and use them to implement missing values. >> >> The actual implementation of SSO will require some more design work, >> because the offset of the most significant byte of `len` within the >> `npy_static_string` struct depends on the platform endianess. For >> little-endian, the most significant byte is not the first byte in the >> struct, so the bytes available for SSO within the struct are not >> contiguous when the fields have the order `{len, buf}`. >> >> I experimented with these ideas, and put the result at >> >> https://github.com/WarrenWeckesser/experiments/tree/master/c/numpy-vstring >> >> The idea that I propose there is to make the memory layout of the >> struct depend on the endianess of the platform, so the most >> significant byte of `len` (which I called `size`, to avoid any chance >> of confusion with the actual length of the string [1]) is at the >> beginning of the struct on big-endian platforms and at the end of the >> struct for little-endian platforms. More details are included in the >> file README.md. Note that I am not suggesting that all the SSO stuff >> be included in the current NEP! This is just a proof-of-concept that >> shows one possibility for SSO. >> >> In that design, the high bit of `size` (which is `len` here) being set >> indicates that the `npy_static_string` struct should not be interpreted >> as the standard `{len, buf}` representation of a string. When the >> second highest bit is set, it means we have a missing value. If the >> second highest bit is not set, SSO is active; see the link above for >> more details. >> >> With this design, `len == 0` *always* means a string of length 0, >> regardless of whether or not `na_object` is defined in the dtype. >> >> Also with this design, an array created with `calloc()` will >> automatically be an array of empty strings. With current design in >> the NEP, an array created with `calloc()` will be either an array of >> empty strings, or an array of missing values, depending on whether or >> not the dtype has `na_object` defined. That conditional behavior >> seems less than desirable. >> >> What do you think? >> >> --Warren >> >> [1] I would like to see `len` renamed to `size` in the >> `npy_static_string` struct, but that's bikeshed stuff, and not >> a blocker. >> >> >> >> >> _______________________________________________ >> NumPy-Discussion mailing list -- numpy-discussion@python.org >> To unsubscribe send an email to numpy-discussion-le...@python.org >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >> Member address: nathan12...@gmail.com >> > _______________________________________________ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: warren.weckes...@gmail.com >
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com