[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

Ralf Gommers Wed, 30 Aug 2023 04:26:56 -0700

On Tue, Aug 29, 2023 at 4:08 PM Nathan <nathan.goldb...@gmail.com> wrote:


> The NEP was merged in draft form, see below.
>
> https://numpy.org/neps/nep-0055-string_dtype.html
>

This is a really nice NEP, thanks Nathan! I see that questions and
constructive feedback is still coming in on GitHub, but for now it seems
like everyone is pretty happy with moving forward with implementing this
new dtype in NumPy.

Cheers,
Rafl




>
> On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldb...@gmail.com> wrote:
>
>> Hello all,
>>
>> I just opened a pull request to add NEP 55, see
>> https://github.com/numpy/numpy/pull/24483.
>>
>> Per NEP 0, I've copied everything up to the "detailed description"
>> section below.
>>
>> I'm looking forward to your feedback on this.
>>
>> -Nathan Goldbaum
>>
>> =========================================================
>> NEP 55 — Add a UTF-8 Variable-Width String DType to NumPy
>> =========================================================
>>
>> :Author: Nathan Goldbaum <ngoldb...@quansight.com>
>> :Status: Draft
>> :Type: Standards Track
>> :Created: 2023-06-29
>>
>>
>> Abstract
>> --------
>>
>> We propose adding a new string data type to NumPy where each item in the
>> array
>> is an arbitrary length UTF-8 encoded string. This will enable performance,
>> memory usage, and usability improvements for NumPy users, including:
>>
>> * Memory savings for workflows that currently use fixed-width strings and
>> store
>> primarily ASCII data or a mix of short and long strings in a single NumPy
>> array.
>>
>> * Downstream libraries and users will be able to move away from object
>> arrays
>> currently used as a substitute for variable-length string arrays,
>> unlocking
>> performance improvements by avoiding passes over the data outside of
>> NumPy.
>>
>> * A more intuitive user-facing API for working with arrays of Python
>> strings,
>> without a need to think about the in-memory array representation.
>>
>> Motivation and Scope
>> --------------------
>>
>> First, we will describe how the current state of support for string or
>> string-like data in NumPy arose. Next, we will summarize the last major
>> previous
>> discussion about this topic. Finally, we will describe the scope of the
>> proposed
>> changes to NumPy as well as changes that are explicitly out of scope of
>> this
>> proposal.
>>
>> History of String Support in Numpy
>> **********************************
>>
>> Support in NumPy for textual data evolved organically in response to
>> early user
>> needs and then changes in the Python ecosystem.
>>
>> Support for strings was added to numpy to support users of the NumArray
>> ``chararray`` type. Remnants of this are still visible in the NumPy API:
>> string-related functionality lives in ``np.char``, to support the
>> obsolete
>> ``np.char.chararray`` class, deprecated since NumPy 1.4 in favor of
>> string
>> DTypes.
>>
>> NumPy's ``bytes_`` DType was originally used to represent the Python 2 ``
>> str``
>> type before Python 3 support was added to NumPy. The bytes DType makes
>> the most
>> sense when it is used to represent Python 2 strings or other
>> null-terminated
>> byte sequences. However, ignoring data after the first null character
>> means the
>> ``bytes_`` DType is only suitable for bytestreams that do not contain
>> nulls, so
>> it is a poor match for generic bytestreams.
>>
>> The ``unicode`` DType was added to support the Python 2 ``unicode``
>> type. It
>> stores data in 32-bit UCS-4 codepoints (e.g. a UTF-32 encoding), which
>> makes for
>> a straightforward implementation, but is inefficient for storing text
>> that can
>> be represented well using a one-byte ASCII or Latin-1 encoding. This was
>> not a
>> problem in Python 2, where ASCII or mostly-ASCII text could use the
>> Python 2
>> ``str`` DType (the current ``bytes_`` DType).
>>
>> With the arrival of Python 3 support in NumPy, the string DTypes were
>> largely
>> left alone due to backward compatibility concerns, although the unicode
>> DType
>> became the default DType for ``str`` data and the old ``string`` DType
>> was
>> renamed the ``bytes_`` DType. This change left NumPy with the sub-optimal
>> situation of shipping a data type originally intended for null-terminated
>> bytestrings as the data type for *all* python ``bytes`` data, and a
>> default
>> string type with an in-memory representation that consumes four times as
>> much
>> memory as needed for ASCII or mostly-ASCII data.
>>
>> Problems with Fixed-Width Strings
>> *********************************
>>
>> Both existing string DTypes represent fixed-width sequences, allowing
>> storage of
>> the string data in the array buffer. This avoids adding out-of-band
>> storage to
>> NumPy, however, it makes for an awkward user interface. In particular, the
>> maximum string size must be inferred by NumPy or estimated by the user
>> before
>> loading the data into a NumPy array or selecting an output DType for
>> string
>> operations. In the worst case, this requires an expensive pass over the
>> full
>> dataset to calculate the maximum length of an array element. It also
>> wastes
>> memory when array elements have varying lengths. Pathological cases where
>> an
>> array stores many short strings and a few very long strings are
>> particularly bad
>> for wasting memory.
>>
>> Downstream usage of string data in NumPy arrays has proven out the need
>> for a
>> variable-width string data type. In practice, most downstream users employ
>> ``object`` arrays for this purpose. In particular, ``pandas`` has
>> explicitly
>> deprecated support for NumPy fixed-width strings, coerces NumPy
>> fixed-width
>> string arrays to ``object`` arrays, and in the future may switch to only
>> supporting string data via ``PyArrow``, which has native support for
>> UTF-8
>> encoded variable-width string arrays [1]_. This is unfortunate, since ``
>> object``
>> arrays have no type guarantees, necessitating expensive sanitization
>> passes and
>> operations using object arrays cannot release the GIL.
>>
>> Previous Discussions
>> --------------------
>>
>> The project last discussed this topic in depth in 2017, when Julian Taylor
>> proposed a fixed-width text data type parameterized by an encoding [2]_.
>> This
>> started a wide-ranging discussion about pain points for working with
>> string data
>> in NumPy and possible ways forward.
>>
>> In the end, the discussion identified two use-cases that the current
>> support for
>> strings does a poor job of handling:
>>
>> * Loading or memory-mapping scientific datasets with unknown encoding,
>> * Working with string data in a manner that allows transparent conversion
>> between NumPy arrays and Python strings, including support for missing
>> strings.
>>
>> As a result of this discussion, improving support for string data was
>> added to
>> the NumPy project roadmap [3]_, with an explicit call-out to add a DType
>> better
>> suited to memory-mapping bytes with any or no encoding, and a
>> variable-width
>> string DType that supports missing data to replace usages of object string
>> arrays.
>>
>> Proposed work
>> -------------
>>
>> This NEP proposes adding ``StringDType``, a DType that stores
>> variable-width
>> heap-allocated strings in Numpy arrays, to replace downstream usages of
>> the
>> ``object`` DType for string data. This work will heavily leverage recent
>> improvements in NumPy to improve support for user-defined DTypes, so we
>> will
>> also necessarily be working on the data type internals in NumPy. In
>> particular,
>> we propose to:
>>
>> * Add a new variable-length string DType to NumPy, targeting NumPy 2.0.
>>
>> * Work out issues related to adding a DType implemented using the
>> experimental
>> DType API to NumPy itself.
>>
>> * Support for a user-provided missing data sentinel.
>>
>> * A cleanup of ``np.char``, with the ufunc-like functions moved to a new
>> namespace for functions and types related to string support.
>>
>> * An update to the ``npy`` and ``npz`` file formats to allow storage of
>> arbitrary-length sidecar data.
>>
>> The following is out of scope for this work:
>>
>> * Changing DType inference for string data.
>>
>> * Adding a DType for memory-mapping text in unknown encodings or a DType
>> that
>> attempts to fix issues with the ``bytes_`` DType.
>>
>> * Fully agreeing on the semantics of a missing data sentinels or adding a
>> missing data sentinel to NumPy itself.
>>
>> * Implement fast ufuncs or SIMD optimizations for string operations.
>>
>> While we're explicitly ruling out implementing these items as part of
>> this work,
>> adding a new string DType helps set up future work that does implement
>> some of
>> these items.
>>
>> If implemented this NEP will make it easier to add a new fixed-width text
>> DType
>> in the future by moving string operations into a long-term supported
>> namespace. We are also proposing a memory layout that should be amenable
>> to
>> writing fast ufuncs and SIMD optimization in some cases, increasing the
>> payoff
>> for writing string operations as SIMD-optimized ufuncs in the future.
>>
>> While we are not proposing adding a missing data sentinel to NumPy, we are
>> proposing adding support for an optional, user-provided missing data
>> sentinel,
>> so this does move NumPy a little closer to officially supporting missing
>> data. We are attempting to avoid resolving the disagreement described in
>> :ref:`NEP 26<NEP26>` and this proposal does not require or preclude
>> adding a
>> missing data sentinel or bitflag-based missing data support in the future.
>>
>> Usage and Impact
>> ----------------
>>
>> The DType is intended as a drop-in replacement for object string arrays.
>> This
>> means that we intend to support as many downstream usages of object string
>> arrays as possible, including all supported NumPy functionality. Pandas
>> is the
>> obvious first user, and substantial work has already occurred to add
>> support in
>> a fork of Pandas. ``scikit-learn`` also uses object string arrays and
>> will be
>> able to migrate to a DType with guarantees that the arrays contains only
>> strings. Both h5py [4]_ and PyTables [5]_ will be able to add first-class
>> support for variable-width UTF-8 encoded string datasets in HDF5. String
>> data
>> are heavily used in machine-learning workflows and downstream machine
>> learning
>> libraries will be able to leverage this new DType.
>>
>> Users who wish to load string data into NumPy and leverage NumPy features
>> like
>> fancy advanced indexing will have a natural choice that offers substantial
>> memory savings over fixed-width unicode strings and better validation
>> guarantees
>> and overall integration with NumPy than object string arrays. Moving to a
>> first-class string DType also removes the need to acquire the GIL during
>> string
>> operations, unlocking future optimizations that are impossible with object
>> string arrays.
>>
>> Performance
>> ***********
>>
>> Here we briefly describe preliminary performance measurements of the
>> prototype
>> version of ``StringDType`` we have implemented outside of NumPy using the
>> experimental DType API. All benchmarks in this section were performed on
>> a Dell
>> XPS 13 9380 running Ubuntu 22.04 and Python 3.11.3 compiled using pyenv.
>> NumPy,
>> Pandas, and the ``StringDType`` prototype were all compiled with meson
>> release
>> builds.
>>
>> Currently, the ``StringDType`` prototype has comparable performance with
>> object
>> arrays and fixed-width string arrays. One exception is array creation from
>> python strings, performance is somewhat slower than object arrays and
>> comparable
>> to fixed-width unicode arrays::
>>
>> In [1]: from stringdtype import StringDType
>>
>> In [2]: import numpy as np
>>
>> In [3]: data = [str(i) * 10 for i in range(100_000)]
>>
>> In [4]: %timeit arr_object = np.array(data, dtype=object)
>> 3.55 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>
>> In [5]: %timeit arr_stringdtype = np.array(data, dtype=StringDType())
>> 12.9 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>
>> In [6]: %timeit arr_strdtype = np.array(data, dtype=str)
>> 11.7 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>
>> In this example, object DTypes are substantially faster because the
>> objects in
>> the ``data`` list can be directly interned in the array, while ``StrDType``
>> and
>> ``StringDType`` need to copy the string data and ``StringDType`` needs to
>> convert the data to UTF-8 and perform additional heap allocations outside
>> the
>> array buffer. In the future, if Python moves to a UTF-8 internal
>> representation
>> for strings, the string loading performance of ``StringDType`` should
>> improve.
>>
>> String operations have similar performance::
>>
>> In [7]: %timeit np.array([s.capitalize() for s in data], dtype=object)
>> 30.2 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>
>> In [8]: %timeit np.char.capitalize(arr_stringdtype)
>> 38.5 ms ± 3.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>
>> In [9]: %timeit np.char.capitalize(arr_strdtype)
>> 46.4 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>
>> The poor performance here is a reflection of the slow iterator-based
>> implementation of operations in ``np.char``. If we were to rewrite these
>> operations as ufuncs, we could unlock substantial performance
>> improvements. Using the example of the ``add`` ufunc, which we have
>> implemented
>> for the ``StringDType`` prototype::
>>
>> In [10]: %timeit arr_object + arr_object
>> 10 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>
>> In [11]: %timeit arr_stringdtype + arr_stringdtype
>> 5.91 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>
>> In [12]: %timeit np.char.add(arr_strdtype, arr_strdtype)
>> 65.9 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>
>> As described below, we have already updated a fork of Pandas to use a
>> prototype
>> version of ``StringDType``. This demonstrates the performance
>> improvements
>> available when data are already loaded into a NumPy array and are passed
>> to a
>> third-party library. Currently Pandas attempts to coerce all ``str``
>> data to
>> ``object`` DType by default, and has to check and sanitize existing ``
>> object``
>> arrays that are passed in. This requires a copy or pass over the data made
>> unnecessary by first-class support for variable-width strings in both
>> NumPy and
>> Pandas::
>>
>> In [13]: import pandas as pd
>>
>> In [14]: %timeit pd.Series(arr_stringdtype)
>> 20.9 µs ± 341 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>
>> In [15]: %timeit pd.Series(arr_object)
>> 1.08 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>
>> We have also implemented a Pandas extension DType that uses ``StringDType
>> ``
>> under the hood, which is also substantially faster for creating Pandas
>> data
>> structures than the existing Pandas string DType that uses ``object``
>> arrays::
>>
>> In [16]: %timeit pd.Series(arr_stringdtype, dtype='string[numpy]')
>> 54.7 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>
>> In [17]: %timeit pd.Series(arr_object, dtype='string[python]')
>> 1.39 ms ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>
>> Backward compatibility
>> ----------------------
>>
>> We are not proposing a change to DType inference for python strings and
>> do not
>> expect to see any impacts on existing usages of NumPy, besides warnings
>> or
>> errors related to new deprecations or expiring deprecations in ``np.char
>> ``.
>>
>>
>> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: ralf.gomm...@googlemail.com
>

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

Reply via email to