I just opened a draft PR to include stringdtype in numpy:
https://github.com/numpy/numpy/pull/25347

If you are interested in testing the new dtype but haven't had the chance
yet, hopefully this should be easier to test. From a clone of the NumPy
repo, doing:

$ git fetch https://github.com/ngoldbaum/numpy stringdtype:stringdtype
$ git checkout stringdtype
$ git submodule update --init
$ python -m pip install .

should build and install a version of NumPy that includes stringdtype,
importable as `np.dtypes.StringDType`. Note that this is based on numpy 2.0
dev, so if you need to use another package that depends on NumPy's ABI to
test the dtype, you'll need to rebuild that project as well.

I'll be continuing to work on this PR to finish integrating stringdtype
into NumPy and write documentation.

If anyone has any feedback on any aspect of the NEP or the stringdtype code
please reply here, on github, or reach out to me privately.

On Wed, Nov 22, 2023 at 1:22 PM Nathan <nathan.goldb...@gmail.com> wrote:

> Hi all,
>
> This week I updated NEP 55 to reflect the changes I made to the prototype
> since
> I initially sent out the NEP. The updated NEP is available on the NumPy
> website:
> https://numpy.org/neps/nep-0055-string_dtype.html.
>
> Updates to the NEP
> ++++++++++++++++++
>
> The changes since the original version of the NEP focus on fully defining
> the C
> API surface we would like to add to the NumPy C API and an implementation
> of a
> per-dtype-instance arena allocator to manage heap allocations. This enabled
> major improvements to the prototype, including implementing the small
> string
> optimization and locking all access to heap memory behind a fine-grained
> mutex
> which should prevent seg faults or memory corruption in a multithreaded
> context. Thanks to Warren Weckesser for his proof of concept code and help
> with
> the small string optimization implementation, he has been added as an
> author to
> reflect his contributions.
>
> With these changes the stringdtype prototype is feature complete.
>
> Call to Review NEP 55
> +++++++++++++++++++++
>
> I'm requesting another round of review on the NEP with an eye toward
> acceptance
> before the NumPy 2.0 release branch is created from main. If I can manage
> it, my
> plan is to have a pull request open that merges the stringdtype codebase
> into
> NumPy before the branch is created. That said, if we decide that we need
> more
> time, or if some issue comes up, I'm happy with this going into main after
> the
> NumPy 2.0 release branch is created.
>
> The most significant feedback we have not addressed from the last round of
> review was Warren's suggestion to add a default missing data sentinel to
> NumPy
> itself. For reasons outlined in the NEP and in my reply to Warren from
> earlier
> this year, we do not want to add a missing data singleton to NumPy, instead
> leaving it to users to choose the missing data semantics they prefer.
> Otherwise I
> believe the current draft addresses all outstanding feedback from the last
> round of review.
>
> Help me Test the Prototype!
> +++++++++++++++++++++++++++
>
> If anyone has time and interest, I would also very much appreciate some
> testing
> and tire-kicking on the stringdtype prototype, available at
> https://github.com/numpy/numpy-user-dtypes.
>
> There is a README with build instructions here:
> https://github.com/numpy/numpy-user-dtypes/blob/main/stringdtype/README.md
>
> If you have a Python development environment with a C compiler, it should
> be
> straightforward to build, install, and test the prototype. Note that you
> must
> have `NUMPY_EXPERIMENTAL_DTYPE_API=1` set in your shell environment or via
> `os.environ` to import stringdtype without error.
>
> I'm particularly interested to hear experiences converting code to use
> stringdtype. This could be code using fixed-width strings in a situation
> where a
> variable-length string array makes more sense or code using object string
> arrays. Are there pain points that aren't discussed in the NEP or existing
> workflows that cannot be adapted to use stringdtype? As far as I'm aware
> there
> aren't, but more testing will help catch issues before we've stabilized
> everything.
>
> My fork of pandas might be a source of inspiration for porting an existing
> non-trivial
> codebase that used object string arrays:
>
>
> https://github.com/pandas-dev/pandas/compare/main...ngoldbaum:pandas:stringdtype
>
> Thanks all for your time, attention, and help reviewing the NEP!
>
> -Nathan
>
>
>
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Reply via email to