Re: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System

Sebastian Berg Sun, 22 Mar 2020 11:33:08 -0700

Hi,

thanks for the feedback!


On Sat, 2020-03-21 at 15:58 -0500, Travis Oliphant wrote:
> Thanks for publicizing this and all the work that has gone into
> getting
> this far.
> 
> I'm extremely supportive of the foundational DType meta-type and
> making
> dtypes classes.  This was the epiphany I had in 2015 that led me to
> experiment with xnd and later mtypes.  I have not had the funding to
> work
> on it much since that time directly.

Right, I realize it is an old idea, if you have any references I am
missing (I am sure there are many), I am happy to add them.

> But, this is the right way to connect the data type system with the
> rest of
> Python typing.  NumPy's current dtypes are currently analogous to
> Python
> 1's user-defined classes.  In Python 1 *all* user-defined classes
> were
> instances of a single Class Type at the C-level, just like currently
> all
> NumPy dtypes are instances of a single Dtype "Type" in Python.
> 
> Shifting Dtypes to be true types (by making them instances of a
> single
> low-level MetaType) is (IMHO) exactly the right approach.   Doing
> this
> first while trying to minimize other changes will help a lot.   I'm
> very
> excited by the work being done in this direction.
> 
> I can appreciate the desire to be cautious on some of the other
> issues
> (like removing numpy array scalars).  I do still think that
> eventually
> removing numpy array scalars in lieu of instances of dtype objects
> will be
> less complex approach and am not sold generally by the reasons listed
> in
> the NEP (though I can appreciate that it's not something to do as
> part of
> *this* NEP) as getting there might take more effort than desired at
> this
> point.

Well, I do think it is a pretty strong design decision here though. If
instances of DType classes are the actual dtypes (and not themselves
classes, then it seems strange if scalars are also (direct) instances
of the same DType class?

Of course we can and probably will allow `isinstance(scalar, DType)` to
work in either case. I do not see a problem with that, although I do
not feel like making that decision right now.

If we can agree on still going this direction for now I am happy of
course. Nothing stops us from amending or finding new solutions in the
future after all.

I used to love the idea, but to be honest, I currently do not see:

1. How to approach it. It would have to be within Python itself, or we
would need more shims for Python builtin types? 
2. That it is actually helpful for users.

If we were designing a new programming language around array computing
principles, I do think that would be the approach I would want to
take/consider. But I simply lack the vision of how marrying the idea
with the scalar language Python would work out well...


> What I would *strongly* recommend right now, however, is to make the
> new
> NumPy dtype system a separately-installable module (kept in the NumPy
> GitHub organization).   In that way, people can depend on the NumPy
> type
> system without depending on NumPy itself.  I think this will become
> more
> and more important in the future.  It will help the design as you see
> NumPy
> as one of many *consumers* of the type system instead of the only
> one.  It
> would also help projects like arrow and xnd and others in the future
> that
> might only want to depend on NumPy's type system but otherwise
> implement
> their own computations.
> 

Right, I agree that is the correct long term direction to see the
DTypes as distinct from the NumPy array, and maybe I should add that to
the NEP.
What I am unsure about is the feasibility? If we develop it outside of
NumPy, it harder to:

1. Use the new system without actually exposing it as public API in
order to incrementally replace the old with a newer machinery.
2. It may require either exposing subclassing capabilities to NumPy to
add shims for legacy DTypes right from the start, or add a bunch of
public API which is only meant to be used within NumPy to that project?

I suppose, I am also not sure that having it in NumPy (at least for
now) is actually all that bad? For array-likes it is probably not a the
most heavy dependency (and it could be slimmed down into a core).

Since the intention is to dog-feed the API as much as possible and to
limit the public API, it should be plausible to rip it out later of
course.
I am sure that will be more overall effort, but I suppose I feel it is
much more approachable effort.

One thing I would like is for projects such as CuPy to be able to
subclass DTypes at some point to tag on the GPU aware things they need.
But in some sense the basic DTypes seem to require being tied in with
NumPy? They must be associated with the NumPy scalars, and the basic
methods defined for all DTypes (also user DTypes) will probably be
strided-inner-loops on the CPU.

> This might require a little more work to provide an adaptor layer in
> NumPy
> itself to use the new system instead of its current dtypes, but I
> think it
> will also help ensure that the datatype API is cleaner and more
> useful to
> the Python ecosystem as a whole.

While I fully agree with the sentiment, I suppose I am scared that the
little more work will end up being too much :(. We have pretty limited
resources and the most difficult work will not be writing the DType API
itself. It will be wrangling it into NumPy and the associated huge
review effort to get it right.
Only by actually wrangling it into NumPy, I think we can also get the
API fully right to begin with.
So, I am scared that moving development outside and trying to add the
more global scope at this time as will make the NumPy side much more
difficult :(. Maybe not even because it is actually much trickier, but
again because it seems less tangible/approachable.

So, my main point here is that we have to make this large refactor as
approachable as possible, and if that means that at some point someone
has to spend a huge, but hopefully straight forward effort, to rip
DTypes out of NumPy, I think that might be a worthy trade-off.
Unless we can activate significantly larger resources very quickly.

Best,

Sebastian


> 
> Thanks,
> 
> -Travis
> 
> 
> 
> 
> On Wed, Mar 11, 2020 at 7:08 PM Sebastian Berg <
> [email protected]>
> wrote:
> 
> > Hi all,
> > 
> > I am pleased to propose NEP 41: First step towards a new Datatype
> > System https://numpy.org/neps/nep-0041-improved-dtype-support.html
> > 
> > This NEP motivates the larger restructure of the datatype machinery
> > in
> > NumPy and defines a few fundamental design aspects. The long term
> > user
> > impact will be allowing easier and more rich featured user defined
> > datatypes.
> > 
> > As this is a large restructure, the NEP represents only the first
> > steps
> > with some additional information in further NEPs being drafted [1]
> > (this may be helpful to look at depending on the level of detail
> > you are
> > interested in).
> > The NEP itself does not propose to add significant new public API.
> > Instead it proposes to move forward with an incremental internal
> > refactor and lays the foundation for this process.
> > 
> > The main user facing change at this time is that datatypes will
> > become
> > classes (e.g. ``type(np.dtype("float64"))`` will be a float64
> > specific
> > class.
> > For most users, the main impact should be many new datatypes in the
> > long run (see the user impact section). However, for those
> > interested
> > in API design within NumPy or with respect to implementing new
> > datatypes, this and the following NEPs are important decisions in
> > the
> > future roadmap for NumPy.
> > 
> > The current full text is reproduced below, although the above link
> > is
> > probably a better way to read it.
> > 
> > Cheers
> > 
> > Sebastian
> > 
> > 
> > [1] NEP 40 gives some background information about the current
> > systems
> > and issues with it:
> > 
> > https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst
> > and NEP 42 being a first draft of how the new API may look like:
> > 
> > 
> > https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst
> > (links to current rendered versions, check
> > https://github.com/numpy/numpy/pull/15505 and
> > https://github.com/numpy/numpy/pull/15507 for updates)
> > 
> > 
> > -----------------------------------------------------------------
> > -----
> > 
> > 
> > =================================================
> > NEP 41 — First step towards a new Datatype System
> > =================================================
> > 
> > :title: Improved Datatype Support
> > :Author: Sebastian Berg
> > :Author: Stéfan van der Walt
> > :Author: Matti Picus
> > :Status: Draft
> > :Type: Standard Track
> > :Created: 2020-02-03
> > 
> > 
> > .. note::
> > 
> >     This NEP is part of a series of NEPs encompassing first
> > information
> >     about the previous dtype implementation and issues with it in
> > NEP 40.
> >     NEP 41 (this document) then provides an overview and generic
> > design
> >     choices for the refactor.
> >     Further NEPs 42 and 43 go into the technical details of the
> > datatype
> >     and universal function related internal and external API
> > changes.
> >     In some cases it may be necessary to consult the other NEPs for
> > a full
> >     picture of the desired changes and why these changes are
> > necessary.
> > 
> > 
> > Abstract
> > --------
> > 
> > `Datatypes <data-type-objects-dtype>` in NumPy describe how to
> > interpret
> > each
> > element in arrays. NumPy provides ``int``, ``float``, and
> > ``complex``
> > numerical
> > types, as well as string, datetime, and structured datatype
> > capabilities.
> > The growing Python community, however, has need for more diverse
> > datatypes.
> > Examples are datatypes with unit information attached (such as
> > meters) or
> > categorical datatypes (fixed set of possible values).
> > However, the current NumPy datatype API is too limited to allow the
> > creation
> > of these.
> > 
> > This NEP is the first step to enable such growth; it will lead to
> > a simpler development path for new datatypes.
> > In the long run the new datatype system will also support the
> > creation
> > of datatypes directly from Python rather than C.
> > Refactoring the datatype API will improve maintainability and
> > facilitate
> > development of both user-defined external datatypes,
> > as well as new features for existing datatypes internal to NumPy.
> > 
> > 
> > Motivation and Scope
> > --------------------
> > 
> > .. seealso::
> > 
> >     The user impact section includes examples of what kind of new
> > datatypes
> >     will be enabled by the proposed changes in the long run.
> >     It may thus help to read these section out of order.
> > 
> > Motivation
> > ^^^^^^^^^^
> > 
> > One of the main issues with the current API is the definition of
> > typical
> > functions such as addition and multiplication for parametric
> > datatypes
> > (see also NEP 40) which require additional steps to determine the
> > output
> > type.
> > For example when adding two strings of length 4, the result is a
> > string
> > of length 8, which is different from the input.
> > Similarly, a datatype which embeds a physical unit must calculate
> > the new
> > unit
> > information: dividing a distance by a time results in a speed.
> > A related difficulty is that the :ref:`current casting rules
> > <_ufuncs.casting>`
> > -- the conversion between different datatypes --
> > cannot describe casting for such parametric datatypes implemented
> > outside
> > of NumPy.
> > 
> > This additional functionality for supporting parametric datatypes
> > introduces
> > increased complexity within NumPy itself,
> > and furthermore is not available to external user-defined
> > datatypes.
> > In general the concerns of different datatypes are not well
> > well-encapsulated.
> > This burden is exacerbated by the exposure of internal C
> > structures,
> > limiting the addition of new fields
> > (for example to support new sorting methods [new_sort]_).
> > 
> > Currently there are many factors which limit the creation of new
> > user-defined
> > datatypes:
> > 
> > * Creating casting rules for parametric user-defined dtypes is
> > either
> > impossible
> >   or so complex that it has never been attempted.
> > * Type promotion, e.g. the operation deciding that adding float and
> > integer
> >   values should return a float value, is very valuable for numeric
> > datatypes
> >   but is limited in scope for user-defined and especially
> > parametric
> > datatypes.
> > * Much of the logic (e.g. promotion) is written in single functions
> >   instead of being split as methods on the datatype itself.
> > * In the current design datatypes cannot have methods that do not
> > generalize
> >   to other datatypes. For example a unit datatype cannot have a
> > ``.to_si()`` method to
> >   easily find the datatype which would represent the same values in
> > SI
> > units.
> > 
> > The large need to solve these issues has driven the scientific
> > community
> > to create work-arounds in multiple projects implementing physical
> > units as
> > an
> > array-like class instead of a datatype, which would generalize
> > better
> > across
> > multiple array-likes (Dask, pandas, etc.).
> > Already, Pandas has made a push into the same direction with its
> > extension arrays [pandas_extension_arrays]_ and undoubtedly
> > the community would be best served if such new features could be
> > common
> > between NumPy, Pandas, and other projects.
> > 
> > Scope
> > ^^^^^
> > 
> > The proposed refactoring of the datatype system is a large
> > undertaking and
> > thus is proposed to be split into various phases, roughly:
> > 
> > * Phase I: Restructure and extend the datatype infrastructure (This
> > NEP 41)
> > * Phase II: Incrementally define or rework API (Detailed largely in
> > NEPs
> > 42/43)
> > * Phase III: Growth of NumPy and Scientific Python Ecosystem
> > capabilities.
> > 
> > For a more detailed accounting of the various phases, see
> > "Plan to Approach the Full Refactor" in the Implementation section
> > below.
> > This NEP proposes to move ahead with the necessary creation of new
> > dtype
> > subclasses (Phase I),
> > and start working on implementing current functionality.
> > Within the context of this NEP all development will be fully
> > private API or
> > use preliminary underscored names which must be changed in the
> > future.
> > Most of the internal and public API choices are part of a second
> > Phase
> > and will be discussed in more detail in the following NEPs 42 and
> > 43.
> > The initial implementation of this NEP will have little or no
> > effect on
> > users,
> > but provides the necessary ground work for incrementally addressing
> > the
> > full rework.
> > 
> > The implementation of this NEP and the following, implied large
> > rework of
> > how
> > datatypes are defined in NumPy is expected to create small
> > incompatibilities
> > (see backward compatibility section).
> > However, a transition requiring large code adaption is not
> > anticipated and
> > not
> > within scope.
> > 
> > Specifically, this NEP makes the following design choices which are
> > discussed
> > in more details in the detailed description section:
> > 
> > 1. Each datatype will be an instance of a subclass of ``np.dtype``,
> > with
> > most of the
> >    datatype-specific logic being implemented
> >    as special methods on the class. In the C-API, these correspond
> > to
> > specific
> >    slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f,
> > np.dtype)`` will remain true,
> >    but ``type(f)`` will be a subclass of ``np.dtype`` rather than
> > just
> > ``np.dtype`` itself.
> >    The ``PyArray_ArrFuncs`` which are currently stored as a pointer
> > on the
> > instance (as ``PyArray_Descr->f``),
> >    should instead be stored on the class as typically done in
> > Python.
> >    In the future these may correspond to python side dunder
> > methods.
> >    Storage information such as itemsize and byteorder can differ
> > between
> >    different dtype instances (e.g. "S3" vs. "S8") and will remain
> > part of
> > the instance.
> >    This means that in the long run the current lowlevel access to
> > dtype
> > methods
> >    will be removed (see ``PyArray_ArrFuncs`` in NEP 40).
> > 
> > 2. The current NumPy scalars will *not* change, they will not be
> > instances
> > of
> >    datatypes. This will also be true for new datatypes, scalars
> > will not be
> >    instances of a dtype (although ``isinstance(scalar, dtype)`` may
> > be made
> >    to return ``True`` when appropriate).
> > 
> > Detailed technical decisions to follow in NEP 42.
> > 
> > Further, the public API will be designed in a way that is
> > extensible in
> > the future:
> > 
> > 3. All new C-API functions provided to the user will hide
> > implementation
> > details
> >    as much as possible. The public API should be an identical, but
> > limited,
> >    version of the C-API used for the internal NumPy datatypes.
> > 
> > The changes to the datatype system in Phase II must include a large
> > refactor of the
> > UFunc machinery, which will be further defined in NEP 43:
> > 
> > 4. To enable all of the desired functionality for new user-defined
> > datatypes,
> >    the UFunc machinery will be changed to replace the current
> > dispatching
> >    and type resolution system.
> >    The old system should be *mostly* supported as a legacy version
> > for
> > some time.
> > 
> > Additionally, as a general design principle, the addition of new
> > user-defined
> > datatypes will *not* change the behaviour of programs.
> > For example ``common_dtype(a, b)`` must not be ``c`` unless ``a``
> > or ``b``
> > know
> > that ``c`` exists.
> > 
> > 
> > User Impact
> > -----------
> > 
> > The current ecosystem has very few user-defined datatypes using
> > NumPy, the
> > two most prominent being: ``rational`` and ``quaternion``.
> > These represent fairly simple datatypes which are not strongly
> > impacted
> > by the current limitations.
> > However, we have identified a need for datatypes such as:
> > 
> > * bfloat16, used in deep learning
> > * categorical types
> > * physical units (such as meters)
> > * datatypes for tracing/automatic differentiation
> > * high, fixed precision math
> > * specialized integer types such as int2, int24
> > * new, better datetime representations
> > * extending e.g. integer dtypes to have a sentinel NA value
> > * geometrical objects [pygeos]_
> > 
> > Some of these are partially solved; for example unit capability is
> > provided
> > in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray`
> > subclasses.
> > Most of these datatypes, however, simply cannot be reasonably
> > defined
> > right now.
> > An advantage of having such datatypes in NumPy is that they should
> > integrate
> > seamlessly with other array or array-like packages such as Pandas,
> > ``xarray`` [xarray_dtype_issue]_, or ``Dask``.
> > 
> > The long term user impact of implementing this NEP will be to allow
> > both
> > the growth of the whole ecosystem by having such new datatypes, as
> > well as
> > consolidating implementation of such datatypes within NumPy to
> > achieve
> > better interoperability.
> > 
> > 
> > Examples
> > ^^^^^^^^
> > 
> > The following examples represent future user-defined datatypes we
> > wish to
> > enable.
> > These datatypes are not part the NEP and choices (e.g. choice of
> > casting
> > rules)
> > are possibilities we wish to enable and do not represent
> > recommendations.
> > 
> > Simple Numerical Types
> > """"""""""""""""""""""
> > 
> > Mainly used where memory is a consideration, lower-precision
> > numeric types
> > such as :ref:```bfloat16`` <
> > https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>`
> > are common in other computational frameworks.
> > For these types the definitions of things such as
> > ``np.common_type`` and
> > ``np.can_cast`` are some of the most important interfaces. Once
> > they
> > support ``np.common_type``, it is (for the most part) possible to
> > find
> > the correct ufunc loop to call, since most ufuncs -- such as add --
> > effectively
> > only require ``np.result_type``::
> > 
> >     >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2)
> > 
> > and `~numpy.result_type` is largely identical to
> > `~numpy.common_type`.
> > 
> > 
> > Fixed, high precision math
> > """"""""""""""""""""""""""
> > 
> > Allowing arbitrary precision or higher precision math is important
> > in
> > simulations. For instance ``mpmath`` defines a precision::
> > 
> >     >>> import mpmath as mp
> >     >>> print(mp.dps)  # the current (default) precision
> >     15
> > 
> > NumPy should be able to construct a native, memory-efficient array
> > from
> > a list of ``mpmath.mpf`` floating point objects::
> > 
> >     >>> arr_15_dps = np.array(mp.arange(3))  # (mp.arange returns a
> > list)
> >     >>> print(arr_15_dps)  # Must find the correct precision from
> > the
> > objects:
> >     array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15])
> > 
> > We should also be able to specify the desired precision when
> > creating the datatype for an array. Here, we use
> > ``np.dtype[mp.mpf]``
> > to find the DType class (the notation is not part of this NEP),
> > which is then instantiated with the desired parameter.
> > This could also be written as ``MpfDType`` class::
> > 
> >     >>> arr_100_dps = np.array([1, 2, 3],
> > dtype=np.dtype[mp.mpf](dps=100))
> >     >>> print(arr_15_dps + arr_100_dps)
> >     array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100])
> > 
> > The ``mpf`` datatype can decide that the result of the operation
> > should be
> > the
> > higher precision one of the two, so uses a precision of 100.
> > Furthermore, we should be able to define casting, for example as
> > in::
> > 
> >     >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype,
> > casting="safe")
> >     True
> >     >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype,
> > casting="safe")
> >     False  # loses precision
> >     >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype,
> > casting="same_kind")
> >     True
> > 
> > Casting from float is a probably always at least a ``same_kind``
> > cast, but
> > in general, it is not safe::
> > 
> >     >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4),
> > casting="safe")
> >     False
> > 
> > since a float64 has a higer precision than the ``mpf`` datatype
> > with
> > ``dps=4``.
> > 
> > Alternatively, we can say that::
> > 
> >     >>> np.common_type(np.dtype[mp.mpf](dps=5),
> > np.dtype[mp.mpf](dps=10))
> >     np.dtype[mp.mpf](dps=10)
> > 
> > And possibly even::
> > 
> >     >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64)
> >     np.dtype[mp.mpf](dps=16)  # equivalent precision to float64 (I
> > believe)
> > 
> > since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)``
> > safely.
> > 
> > 
> > Categoricals
> > """"""""""""
> > 
> > Categoricals are interesting in that they can have fixed,
> > predefined
> > values,
> > or can be dynamic with the ability to modify categories when
> > necessary.
> > The fixed categories (defined ahead of time) is the most straight
> > forward
> > categorical definition.
> > Categoricals are *hard*, since there are many strategies to
> > implement them,
> > suggesting NumPy should only provide the scaffolding for user-
> > defined
> > categorical types. For instance::
> > 
> >     >>> cat = Categorical(["eggs", "spam", "toast"])
> >     >>> breakfast = array(["eggs", "spam", "eggs", "toast"],
> > dtype=cat)
> > 
> > could store the array very efficiently, since it knows that there
> > are only
> > 3
> > categories.
> > Since a categorical in this sense knows almost nothing about the
> > data
> > stored
> > in it, few operations makes, sense, although equality does:
> > 
> >     >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"],
> > dtype=cat)
> >     >>> breakfast == breakfast2
> >     array[True, False, True, False])
> > 
> > The categorical datatype could work like a dictionary: no two
> > items names can be equal (checked on dtype creation), so that the
> > equality
> > operation above can be performed very efficiently.
> > If the values define an order, the category labels (internally
> > integers)
> > could
> > be ordered the same way to allow efficient sorting and comparison.
> > 
> > Whether or not casting is defined from one categorical with less to
> > one
> > with
> > strictly more values defined, is something that the Categorical
> > datatype
> > would
> > need to decide. Both options should be available.
> > 
> > 
> > Unit on the Datatype
> > """"""""""""""""""""
> > 
> > There are different ways to define Units, depending on how the
> > internal
> > machinery would be organized, one way is to have a single Unit
> > datatype
> > for every existing numerical type.
> > This will be written as ``Unit[float64]``, the unit itself is part
> > of the
> > DType instance ``Unit[float64]("m")`` is a ``float64`` with meters
> > attached::
> > 
> >     >>> from astropy import units
> >     >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m  #
> > meters
> >     >>> print(meters)
> >     array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
> > 
> > Note that units are a bit tricky. It is debatable, whether::
> > 
> >     >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
> > 
> > should be valid syntax (coercing the float scalars without a unit
> > to
> > meters).
> > Once the array is created, math will work without any issue::
> > 
> >     >>> meters / (2 * unit.seconds)
> >     array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s"))
> > 
> > Casting is not valid from one unit to the other, but can be valid
> > between
> > different scales of the same dimensionality (although this may be
> > "unsafe")::
> > 
> >     >>> meters.astype(Unit[float64]("s"))
> >     TypeError: Cannot cast meters to seconds.
> >     >>> meters.astype(Unit[float64]("km"))
> >     >>> # Convert to centimeter-gram-second (cgs) units:
> >     >>> meters.astype(meters.dtype.to_cgs())
> > 
> > The above notation is somewhat clumsy. Functions
> > could be used instead to convert between units.
> > There may be ways to make these more convenient, but those must be
> > left
> > for future discussions::
> > 
> >     >>> units.convert(meters, "km")
> >     >>> units.to_cgs(meters)
> > 
> > There are some open questions. For example, whether additional
> > methods
> > on the array object could exist to simplify some of the notions,
> > and how
> > these
> > would percolate from the datatype to the ``ndarray``.
> > 
> > The interaction with other scalars would likely be defined
> > through::
> > 
> >     >>> np.common_type(np.float64, Unit)
> >     Unit[np.float64](dimensionless)
> > 
> > Ufunc output datatype determination can be more involved than for
> > simple
> > numerical dtypes since there is no "universal" output type::
> > 
> >     >>> np.multiply(meters, seconds).dtype !=
> > np.result_type(meters,
> > seconds)
> > 
> > In fact ``np.result_type(meters, seconds)`` must error without
> > context
> > of the operation being done.
> > This example highlights how the specific ufunc loop
> > (loop with known, specific DTypes as inputs), has to be able to to
> > make
> > certain decisions before the actual calculation can start.
> > 
> > 
> > 
> > Implementation
> > --------------
> > 
> > Plan to Approach the Full Refactor
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > To address these issues in NumPy and enable new datatypes,
> > multiple development stages are required:
> > 
> > * Phase I: Restructure and extend the datatype infrastructure (This
> > NEP)
> > 
> >   * Organize Datatypes like normal Python classes [`PR 15508`]_
> > 
> > * Phase II: Incrementally define or rework API
> > 
> >   * Create a new and easily extensible API for defining new
> > datatypes
> >     and related functionality. (NEP 42)
> > 
> >   * Incrementally define all necessary functionality through the
> > new API
> > (NEP 42):
> > 
> >     * Defining operations such as ``np.common_type``.
> >     * Allowing to define casting between datatypes.
> >     * Add functionality necessary to create a numpy array from
> > Python
> > scalars
> >       (i.e. ``np.array(...)``).
> >     * …
> > 
> >   * Restructure how universal functions work (NEP 43), in order to:
> > 
> >     * make it possible to allow a `~numpy.ufunc` such as ``np.add``
> > to be
> >       extended by user-defined datatypes such as Units.
> > 
> >     * allow efficient lookup for the correct implementation for
> > user-defined
> >       datatypes.
> > 
> >     * enable reuse of existing code. Units should be able to use
> > the
> >       normal math loops and add additional logic to determine
> > output type.
> > 
> > * Phase III: Growth of NumPy and Scientific Python Ecosystem
> > capabilities:
> > 
> >   * Cleanup of legacy behaviour where it is considered buggy or
> > undesirable.
> >   * Provide a path to define new datatypes from Python.
> >   * Assist the community in creating types such as Units or
> > Categoricals
> >   * Allow strings to be used in functions such as ``np.equal`` or
> > ``np.add``.
> >   * Remove legacy code paths within NumPy to improve long term
> > maintainability
> > 
> > This document serves as a basis for phase I and provides the vision
> > and
> > motivation for the full project.
> > Phase I does not introduce any new user-facing features,
> > but is concerned with the necessary conceptual cleanup of the
> > current
> > datatype system.
> > It provides a more "pythonic" datatype Python type object, with a
> > clear
> > class hierarchy.
> > 
> > The second phase is the incremental creation of all APIs necessary
> > to
> > define
> > fully featured datatypes and reorganization of the NumPy datatype
> > system.
> > This phase will thus be primarily concerned with defining an,
> > initially preliminary, stable public API.
> > 
> > Some of the benefits of a large refactor may only become evident
> > after the
> > full
> > deprecation of the current legacy implementation (i.e. larger code
> > removals).
> > However, these steps are necessary for improvements to many parts
> > of the
> > core NumPy API, and are expected to make the implementation
> > generally
> > easier to understand.
> > 
> > The following figure illustrates the proposed design at a high
> > level,
> > and roughly delineates the components of the overall design.
> > Note that this NEP only regards Phase I (shaded area),
> > the rest encompasses Phase II and the design choices are up for
> > discussion,
> > however, it highlights that the DType datatype class is the
> > central,
> > necessary
> > concept:
> > 
> > .. image:: _static/nep-0041-mindmap.svg
> > 
> > 
> > First steps directly related to this NEP
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > The required changes necessary to NumPy are large and touch many
> > areas
> > of the code base
> > but many of these changes can be addressed incrementally.
> > 
> > To enable an incremental approach we will start by creating a C
> > defined
> > ``PyArray_DTypeMeta`` class with its instances being the ``DType``
> > classes,
> > subclasses of ``np.dtype``.
> > This is necessary to add the ability of storing custom slots on the
> > DType
> > in C.
> > This ``DTypeMeta`` will be implemented first to then enable
> > incremental
> > restructuring of current code.
> > 
> > The addition of ``DType`` will then enable addressing other changes
> > incrementally, some of which may begin before the settling the full
> > internal
> > API:
> > 
> > 1. New machinery for array coercion, with the goal of enabling user
> > DTypes
> >    with appropriate class methods.
> > 2. The replacement or wrapping of the current casting machinery.
> > 3. Incremental redefinition of the current ``PyArray_ArrFuncs``
> > slots into
> >    DType method slots.
> > 
> > At this point, no or only very limited new public API will be added
> > and
> > the internal API is considered to be in flux.
> > Any new public API may be set up give warnings and will have
> > leading
> > underscores
> > to indicate that it is not finalized and can be changed without
> > warning.
> > 
> > 
> > Backward compatibility
> > ----------------------
> > 
> > While the actual backward compatibility impact of implementing
> > Phase I and
> > II
> > are not yet fully clear, we anticipate, and accept the following
> > changes:
> > 
> > * **Python API**:
> > 
> >   * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``,
> > while
> > right
> >     now ``type(np.dtype("f8")) is np.dtype``.
> >     Code should use ``isinstance`` checks, and in very rare cases
> > may have
> > to
> >     be adapted to use it.
> > 
> > * **C-API**:
> > 
> >     * In old versions of NumPy ``PyArray_DescrCheck`` is a macro
> > which uses
> >       ``type(dtype) is np.dtype``. When compiling against an old
> > NumPy
> > version,
> >       the macro may have to be replaced with the corresponding
> >       ``PyObject_IsInstance`` call. (If this is a problem, we could
> > backport
> >       fixing the macro)
> > 
> >    * The UFunc machinery changes will break *limited* parts of the
> > current
> >      implementation. Replacing e.g. the default ``TypeResolver`` is
> > expected
> >      to remain supported for a time, although optimized masked
> > inner loop
> > iteration
> >      (which is not even used *within* NumPy) will no longer be
> > supported.
> > 
> >    * All functions currently defined on the dtypes, such as
> >      ``PyArray_Descr->f->nonzero``, will be defined and accessed
> > differently.
> >      This means that in the long run lowlevel access code will
> >      have to be changed to use the new API. Such changes are
> > expected to be
> >      necessary in very few project.
> > 
> > * **dtype implementors (C-API)**:
> > 
> >   * The array which is currently provided to some functions (such
> > as cast
> > functions),
> >     will no longer be provided.
> >     For example ``PyArray_Descr->f->nonzero`` or
> > ``PyArray_Descr->f->copyswapn``,
> >     may instead receive a dummy array object with only some fields
> > (mainly
> > the
> >     dtype), being valid.
> >     At least in some code paths, a similar mechanism is already
> > used.
> > 
> >   * The ``scalarkind`` slot and registration of scalar casting will
> > be
> >      removed/ignored without replacement.
> >      It currently allows partial value-based casting.
> >      The ``PyArray_ScalarKind`` function will continue to work for
> > builtin
> > types,
> >      but will not be used internally and be deprecated.
> > 
> >    * Currently user dtypes are defined as instances of
> > ``np.dtype``.
> >      The creation works by the user providing a prototype instance.
> >      NumPy will need to modify at least the type during
> > registration.
> >      This has no effect for either ``rational`` or ``quaternion``
> > and
> > mutation
> >      of the structure seems unlikely after registration.
> > 
> > Since there is a fairly large API surface concerning datatypes,
> > further
> > changes
> > or the limitation certain function to currently existing datatypes
> > is
> > likely to occur.
> > For example functions which use the type number as input
> > should be replaced with functions taking DType classes instead.
> > Although public, large parts of this C-API seem to be used rarely,
> > possibly never, by downstream projects.
> > 
> > 
> > 
> > Detailed Description
> > --------------------
> > 
> > This section details the design decisions covered by this NEP.
> > The subsections correspond to the list of design choices presented
> > in the Scope section.
> > 
> > Datatypes as Python Classes (1)
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > The current NumPy datatypes are not full scale python classes.
> > They are instead (prototype) instances of a single ``np.dtype``
> > class.
> > Changing this means that any special handling, e.g. for
> > ``datetime``
> > can be moved to the Datetime DType class instead, away from
> > monolithic
> > general
> > code (e.g. current ``PyArray_AdjustFlexibleDType``).
> > 
> > The main consequence of this change with respect to the API is that
> > special methods move from the dtype instances to methods on the new
> > DType
> > class.
> > This is the typical design pattern used in Python.
> > Organizing these methods and information in a more Pythonic way
> > provides a
> > solid foundation for refining and extending the API in the future.
> > The current API cannot be extended due to how it is exposed
> > publically.
> > This means for example that the methods currently stored in
> > ``PyArray_ArrFuncs``
> > on each datatype (see NEP 40) will be defined differently in the
> > future and
> > deprecated in the long run.
> > 
> > The most prominent visible side effect of this will be that
> > ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore.
> > Instead it will be a subclass of ``np.dtype`` meaning that
> > ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true.
> > This will also add the ability to use ``isinstance(dtype,
> > np.dtype[float64])``
> > thus removing the need to use ``dtype.kind``, ``dtype.char``, or
> > ``dtype.type``
> > to do this check.
> > 
> > With the design decision of DTypes as full-scale Python classes,
> > the question of subclassing arises.
> > Inheritance, however, appears problematic and a complexity best
> > avoided
> > (at least initially) for container datatypes.
> > Further, subclasses may be more interesting for interoperability
> > for
> > example with GPU backends (CuPy) storing additional methods related
> > to the
> > GPU rather than as a mechanism to define new datatypes.
> > A class hierarchy does provides value, this may be achieved by
> > allowing the creation of *abstract* datatypes.
> > An example for an abstract datatype would be the datatype
> > equivalent of
> > ``np.floating``, representing any floating point number.
> > These can serve the same purpose as Python's abstract base classes.
> > 
> > 
> > Scalars should not be instances of the datatypes (2)
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > For simple datatypes such as ``float64`` (see also below), it seems
> > tempting that the instance of a ``np.dtype("float64")`` can be the
> > scalar.
> > This idea may be even more appealing due to the fact that scalars,
> > rather than datatypes, currently define a useful type hierarchy.
> > 
> > However, we have specifically decided against this for a number of
> > reasons.
> > First, the new datatypes described herein would be instances of
> > DType
> > classes.
> > Making these instances themselves classes, while possible, adds
> > additional
> > complexity that users need to understand.
> > It would also mean that scalars must have storage information (such
> > as
> > byteorder)
> > which is generally unnecessary and currently is not used.
> > Second, while the simple NumPy scalars such as ``float64`` may be
> > such
> > instances,
> > it should be possible to create datatypes for Python objects
> > without
> > enforcing
> > NumPy as a dependency.
> > However, Python objects that do not depend on NumPy cannot be
> > instances of
> > a NumPy DType.
> > Third, there is a mismatch between the methods and attributes which
> > are
> > useful
> > for scalars and datatypes. For instance ``to_float()`` makes sense
> > for a
> > scalar
> > but not for a datatype and ``newbyteorder`` is not useful on a
> > scalar (or
> > has
> > a different meaning).
> > 
> > Overall, it seem rather than reducing the complexity, i.e. by
> > merging
> > the two distinct type hierarchies, making scalars instances of
> > DTypes would
> > increase the complexity of both the design and implementation.
> > 
> > A possible future path may be to instead simplify the current NumPy
> > scalars to
> > be much simpler objects which largely derive their behaviour from
> > the
> > datatypes.
> > 
> > C-API for creating new Datatypes (3)
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > The current C-API with which users can create new datatypes
> > is limited in scope, and requires use of "private" structures. This
> > means
> > the API is not extensible: no new members can be added to the
> > structure
> > without losing binary compatibility.
> > This has already limited the inclusion of new sorting methods into
> > NumPy [new_sort]_.
> > 
> > The new version shall thus replace the current ``PyArray_ArrFuncs``
> > structure used
> > to define new datatypes.
> > Datatypes that currently exist and are defined using these slots
> > will be
> > supported during a deprecation period.
> > 
> > The most likely solution is to hide the implementation from the
> > user and
> > thus make
> > it extensible in the future is to model the API after Python's
> > stable
> > API [PEP-384]_:
> > 
> > .. code-block:: C
> > 
> >     static struct PyArrayMethodDef slots[] = {
> >         {NPY_dt_method, method_implementation},
> >         ...,
> >         {0, NULL}
> >     }
> > 
> >     typedef struct{
> >       PyTypeObject *typeobj;  /* type of python scalar */
> >       ...;
> >       PyType_Slot *slots;
> >     } PyArrayDTypeMeta_Spec;
> > 
> >     PyObject* PyArray_InitDTypeMetaFromSpec(
> >             PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec
> > *dtype_spec);
> > 
> > The C-side slots should be designed to mirror Python side methods
> > such as ``dtype.__dtype_method__``, although the exposure to Python
> > is
> > a later step in the implementation to reduce the complexity of the
> > initial
> > implementation.
> > 
> > 
> > C-API Changes to the UFunc Machinery (4)
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > Proposed changes to the UFunc machinery will be part of NEP 43.
> > However, the following changes will be necessary (see NEP 40 for a
> > detailed
> > description of the current implementation and its issues):
> > 
> > * The current UFunc type resolution must be adapted to allow better
> > control
> >   for user-defined dtypes as well as resolve current
> > inconsistencies.
> > * The inner-loop used in UFuncs must be expanded to include a
> > return value.
> >   Further, error reporting must be improved, and passing in dtype-
> > specific
> >   information enabled.
> >   This requires the modification of the inner-loop function
> > signature and
> >   addition of new hooks called before and after the inner-loop is
> > used.
> > 
> > An important goal for any changes to the universal functions will
> > be to
> > allow the reuse of existing loops.
> > It should be easy for a new units datatype to fall back to existing
> > math
> > functions after handling the unit related computations.
> > 
> > 
> > Discussion
> > ----------
> > 
> > See NEP 40 for a list of previous meetings and discussions.
> > 
> > 
> > References
> > ----------
> > 
> > .. [pandas_extension_arrays]
> > https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types
> > 
> > .. _xarray_dtype_issue: 
> > https://github.com/pydata/xarray/issues/1262
> > 
> > .. [pygeos] https://github.com/caspervdw/pygeos
> > 
> > .. [new_sort] https://github.com/numpy/numpy/pull/12945
> > 
> > .. [PEP-384] https://www.python.org/dev/peps/pep-0384/
> > 
> > .. [PR 15508] https://github.com/numpy/numpy/pull/15508
> > 
> > 
> > Copyright
> > ---------
> > 
> > This document has been placed in the public domain.
> > 
> > 
> > Acknowledgments
> > ---------------
> > 
> > The effort to create new datatypes for NumPy has been discussed for
> > several
> > years in many different contexts and settings, making it impossible
> > to
> > list everyone involved.
> > We would like to thank especially Stephan Hoyer, Nathaniel Smith,
> > and Eric
> > Wieser
> > for repeated in-depth discussion about datatype design.
> > We are very grateful for the community input in reviewing and
> > revising this
> > NEP and would like to thank especially Ross Barnowski and Ralf
> > Gommers.
> > 
> > _______________________________________________
> > NumPy-Discussion mailing list
> > [email protected]
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> > 
> 
> _______________________________________________
> NumPy-Discussion mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc
Description: This is a digitally signed message part

_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System

Reply via email to