Hi all, In https://github.com/numpy/numpy/issues/18407 it was reported that there is a regression for `np.array()` and friends in NumPy 1.20 for code such as:
np.array(["1234"], dtype=("U1", 4)) # NumPy 1.20: array(['1', '1', '1', '1'], dtype='<U1') # NumPy 1.19: array(['1', '2', '3', '4'], dtype='<U1') The Basics ---------- This happens when you ask for a rare "subarray" dtype, ways to create it are: np.dtype(("U1", 4)) np.dtype("(4)U1,") # (does not have a field, only a subarray) Both of which give the same subarray dtype a "U1" dtype with shape 4. One thing to know about these dtypes is that they cannot be attached to an array: np.zeros(3, dtype="(4)U1,").dtype == "U1" np.zeros(3, dtype="(4)U1,").shape == (3, 4) I.e. the shape is moved/added into the array itself (instead of remaining part of the dtype). The Change ---------- Now what/why did something change? When filling subarray dtypes, NumPy normally fills every element with the same input. In the above case in most cases NumPy will give the 1.20 result because it assigns "1234" to every subarray element individually; maybe confusingly, this truncates so that only the "1" is actually assigned, we can proof it with a structured dtype (same result in 1.19 and 1.20): >>> np.array(["1234"], dtype="(4)U1,i") array([(['1', '1', '1', '1'], 1234)], dtype=[('f0', '<U1', (4,)), ('f1', '<i4')]) Another, weirder case which changed (more obviously for the better is: >>> np.array("1234", dtype="(4)U1,") # Numpy 1.20: array(['1', '1', '1', '1'], dtype='<U1') # NumPy 1.19: array(['1', '', '', ''], dtype='<U1') And, to point it out, we can have subarrays that are not 1-D: >>> np.array(["12"],dtype=("(2,2)U1,")) array([[['1', '1'], ['2', '2']]], dtype='<U1') # NumPy 1.19, 1.20 all is '1' The Cause --------- The cause of the 1.19 behaviour is two-fold: 1. The "subarray" part of the dtype is moved into the array after the dimension is found. At this point strings are always considered "scalars". In most above examples, the new array shape is (1,)+(4,). 2. When filling the new array with values, it now has an _additional_ dimension! Because of this, the string is now suddenly considered a sequence, so it behaves the same as if `list("1234")`. Although, normally, NumPy would never consider a string a sequence. The Solution? ------------- I honestly don't have one. We can consider strings as sequences in this weird special case. That will probably create other weird special cases, but they would be even more hidden (I expect mainly odder things throwing an error). Should we try to document this better in the release notes or can we think of some better (or at least louder) solution? Cheers, Sebastian
signature.asc
Description: This is a digitally signed message part
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion