Re: [Numpy-discussion] Moving forward with value based casting

Allan Haldane Thu, 06 Jun 2019 08:58:01 -0700


I think dtype-based casting makes a lot of sense, the problem is
backward compatibility.


Numpy casting is weird in a number of ways: The array + array casting is
unexpected to many users (eg, uint64 + int64 -> float64), and the
casting of array + scalar is different from that, and value based.
Personally I wouldn't want to try change it unless we make a
backward-incompatible release (numpy 2.0), based on my experience trying
to change much more minor things. We already put "casting" on the list
of desired backward-incompatible changes on the list here:
https://github.com/numpy/numpy/wiki/Backwards-incompatible-ideas-for-a-major-release

Relatedly, I've previously dreamed about a different "C-style" way
casting might behave:
https://gist.github.com/ahaldane/0f5ade49730e1a5d16ff6df4303f2e76

The proposal there is that array + array casting, array + scalar, and
array + python casting would all work in the same dtype-based way, which
mimics the familiar "C" casting rules.

See also:
https://github.com/numpy/numpy/issues/12525

Allan


On 6/5/19 4:41 PM, Sebastian Berg wrote:
> Hi all,
> 
> TL;DR:
> 
> Value based promotion seems complex both for users and ufunc-
> dispatching/promotion logic. Is there any way we can move forward here,
> and if we do, could we just risk some possible (maybe not-existing)
> corner cases to break early to get on the way?
> 
> -----------
> 
> Currently when you write code such as:
> 
> arr = np.array([1, 43, 23], dtype=np.uint16)
> res = arr + 1
> 
> Numpy uses fairly sophisticated logic to decide that `1` can be
> represented as a uint16, and thus for all unary functions (and most
> others as well), the output will have a `res.dtype` of uint16.
> 
> Similar logic also exists for floating point types, where a lower
> precision floating point can be used:
> 
> arr = np.array([1, 43, 23], dtype=np.float32)
> (arr + np.float64(2.)).dtype  # will be float32
> 
> Currently, this value based logic is enforced by checking whether the
> cast is possible: "4" can be cast to int8, uint8. So the first call
> above will at some point check if "uint16 + uint16 -> uint16" is a
> valid operation, find that it is, and thus stop searching. (There is
> the additional logic, that when both/all operands are scalars, it is
> not applied).
> 
> Note that while it is defined in terms of casting "1" to uint8 safely
> being possible even though 1 may be typed as int64. This logic thus
> affects all promotion rules as well (i.e. what should the output dtype
> be).
> 
> 
> There 2 main discussion points/issues about it:
> 
> 1. Should value based casting/promotion logic exist at all?
> 
> Arguably an `np.int32(3)` has type information attached to it, so why
> should we ignore it. It can also be tricky for users, because a small
> change in values can change the result data type.
> Because 0-D arrays and scalars are too close inside numpy (you will
> often not know which one you get). There is not much option but to
> handle them identically. However, it seems pretty odd that:
>  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
>  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
> 
> give a different result.
> 
> This is a bit different for python scalars, which do not have a type
> attached already.
> 
> 
> 2. Promotion and type resolution in Ufuncs:
> 
> What is currently bothering me is that the decision what the output
> dtypes should be currently depends on the values in complicated ways.
> It would be nice if we can decide which type signature to use without
> actually looking at values (or at least only very early on).
> 
> One reason here is caching and simplicity. I would like to be able to
> cache which loop should be used for what input. Having value based
> casting in there bloats up the problem.
> Of course it currently works OK, but especially when user dtypes come
> into play, caching would seem like a nice optimization option.
> 
> Because `uint8(127)` can also be a `int8`, but uint8(128) it is not as
> simple as finding the "minimal" dtype once and working with that." 
> Of course Eric and I discussed this a bit before, and you could create
> an internal "uint7" dtype which has the only purpose of flagging that a
> cast to int8 is safe.
> 
> I suppose it is possible I am barking up the wrong tree here, and this
> caching/predictability is not vital (or can be solved with such an
> internal dtype easily, although I am not sure it seems elegant).
> 
> 
> Possible options to move forward
> --------------------------------
> 
> I have to still see a bit how trick things are. But there are a few
> possible options. I would like to move the scalar logic to the
> beginning of ufunc calls:
>   * The uint7 idea would be one solution
>   * Simply implement something that works for numpy and all except
>     strange external ufuncs (I can only think of numba as a plausible
>     candidate for creating such).
> 
> My current plan is to see where the second thing leaves me.
> 
> We also should see if we cannot move the whole thing forward, in which
> case the main decision would have to be forward to where. My opinion is
> currently that when a type has a dtype associated with it clearly, we
> should always use that dtype in the future. This mostly means that
> numpy dtypes such as `np.int64` will always be treated like an int64,
> and never like a `uint8` because they happen to be castable to that.
> 
> For values without a dtype attached (read python integers, floats), I
> see three options, from more complex to simpler:
> 
> 1. Keep the current logic in place as much as possible
> 2. Only support value based promotion for operators, e.g.:
>    `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
>    The upside is that it limits the complexity to a much simpler
>    problem, the downside is that the ufunc call and operator match
>    less clearly.
> 3. Just associate python float with float64 and python integers with
>    long/int64 and force users to always type them explicitly if they
>    need to.
> 
> The downside of 1. is that it doesn't help with simplifying the current
> situation all that much, because we still have the special casting
> around...
> 
> 
> I have realized that this got much too long, so I hope it makes sense.
> I will continue to dabble along on these things a bit, so if nothing
> else maybe writing it helps me to get a bit clearer on things...
> 
> Best,
> 
> Sebastian
> 
> 
> 
> _______________________________________________
> NumPy-Discussion mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 

_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Moving forward with value based casting

Reply via email to