Re: [Numpy-discussion] Multiple-field indexing: view vs copy in 1.14+

josef . pktd Mon, 22 Jan 2018 08:11:08 -0800

On Mon, Jan 22, 2018 at 10:53 AM, <[email protected]> wrote:

>
>
> On Sun, Jan 21, 2018 at 9:48 PM, Allan Haldane <[email protected]>
> wrote:
>
>> Hello all,
>>
>> We are making a decision (again) about what to do about the
>> behavior of multiple-field indexing of structured arrays: Should
>> it return a view or a copy, and on what release schedule?
>>
>> As a reminder, this refers to operations like (1.13 behavior):
>>
>>     >>> a = np.zeros(3, dtype=[('a', 'i4'), ('b', 'i4'), ('c', 'f4')])
>>     >>> a[['a', 'c']]
>>     array([(0, 0.), (0, 0.), (0, 0.)],
>>           dtype=[('a', '<i4'), ('c', '<f4')]
>>
>> In numpy 1.14.0 we made this return a view instead of a copy, but
>> downstream test failures suggest we reconsider. In our current
>> implementation for 1.14.1, we have reverted this change, but
>> still plan to go through with it in 1.15.
>>
>> See here for our discussion the problem and solutions:
>> https://github.com/numpy/numpy/pull/10411
>>
>> The two main options we have discussed are either to try to make
>> the change in 1.15, or never make the change at all and always
>> return a copy.
>>
>> Here are some pros and cons:
>>
>> Pros (change to view in 1.15)
>> =============================
>>
>>  * Views are useful and convenient. Other forms of indexing also
>>    often return views so this is more consistent.
>>  * This change has been planned since numpy 1.7 in 2009,
>>    and there have been visible FutureWarnings about it since
>>    then. Anyone whose code will break should have seen the
>>    warnings. It has been extensively warned about in recent
>>    release notes.
>>  * Past discussions have supported the change. See my comment in
>>    the PR with many links to them and to other history.
>>  * Users have requested the change on the list.
>>  * Possibly a majority of the reported code failures were not
>>    actually caused by the change, but by another bug (#8100)
>>    involving np.load/np.save which this change exposed. If we
>>    push it off to 1.15, we will have time to fix this other bug.
>>    (There were no FutureWarnings for this breakage, of course).
>>  * The code that really will break is of the form
>>          a[['a', 'c']].view('i8')
>>    because the returned itemsize is different. This has
>>    raised FutureWarnings since numpy 1.7, and no users reported
>>    failures due to this change. In the PR we still try to
>>    mitigate this breakage by introducing a new method
>>    `pack_fields`, which converts the result into the 1.13 form,
>>    so that
>>          np.pack_fields(a[['a', 'c']]).view('i8')
>>    will work.
>>
>>
>> Cons (keep returning a copy)
>> ============================
>>
>>  * The extra convenience is not really that much, and fancy
>>    indexing also returns a copy instead of a view, so there is
>>    a precedent there.
>>  * We want to minimize compatibility breaks with old behavior.
>>    We've had a fair amount of discussion and complaints about
>>    how we break things in general.
>>  * We have lived with a "copy" for 8 years now. At some point the
>>    behavior gets set in stone for compatibility reasons.
>>  * Users have written to the list and github about their code
>>    breaking in 1.14.0. As far as I am aware, they all refer
>>    to the #8100 problem.
>>  * If a new function `pack_fields` is needed to guard against
>>    mishaps with the view behavior, that seems like a sign that
>>    keeping the copy behavior is the best option from an API
>>    perspective.
>>
>> My initial vote is go with the change in 1.15: The "view" code
>> that will ultimately break (not the code related to #8100) has
>> been sending FutureWarnings for many years, and I am not aware of
>> any user complaints involving it: All the complaints so far
>> would be fixed with #8100 in 1.15.
>>
>>
> (Note based on a linked mailing list thread, 2012 might be the last time I
> looked more closely at structured dtypes.
> So some of what I understand might be outdated.)
>
>
> views on structured dtypes are very important, but viewing them as
> standard arrays with standard dtypes is the main part that I had used.
> Essentially structured dtypes are useless for any computation, e.g. just
> some simple reduce operation. To work with them we need a standard view.
>
> I think the usecase that fails in statsmodels (except there is no test
> failure anymore because we switched to using pandas in the unit test)
>



do add a detail here

results is a recarray created from a csv file with
results = genfromtxt(open(filename, "rb"), delimiter=",",
names=True,dtype=float)

['acvar_lb','acvar_ub'] are the last two columns, so this corresponds to my
example below where AFAIU no padding is necessary to get a view.


>
>
>         cls.confint_res = cls.results[['acvar_lb','acvar
> _ub']].view((float,
> >
>  2))
> E       ValueError: Changing the dtype to a subarray type is only
> supported if the total itemsize is unchanged
>
>
> This is similar to the above example
> a[['a', 'c']].view('i8')
> but it doesn't try to combine fields.
>
> In  many examples where I used structured dtypes a long time ago, switched
> between consistent views as either a standard array of subsets or as
> .structured dtypes.
> For this usecase it wouldn't matter whether a[['a', 'c']] returns a view
> or copy, as long as we can get the second view that is consistent with the
> selected part of the memory. This would also be independent of whether
> numpy pads internally and adjusts the strides if possible or not.
>
> >>> np.__version__
> '1.11.2'
>
> >>> a = np.ones(5, dtype=[('a', 'i8'), ('b', 'f8'), ('c', 'f8')])
> >>> a
> array([(1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0),
>        (1, 1.0, 1.0)],
>       dtype=[('a', '<i8'), ('b', '<f8'), ('c', '<f8')])
>
> >>> a.mean(0)
> Traceback (most recent call last):
>   File "<pyshell#15>", line 1, in <module>
>     a.mean(0)
>   File "C:\...\python-3.4.4.amd64\lib\site-packages\numpy\core\_methods.py",
> line 65, in _mean
>     ret = umr_sum(arr, axis, dtype, out, keepdims)
> TypeError: cannot perform reduce with flexible type
>
> >>> a[['b', 'c']].mean(0)
> Traceback (most recent call last):
>   File "<pyshell#16>", line 1, in <module>
>     a[['b', 'c']].mean(0)
>   File "C:\...\python-3.4.4.amd64\lib\site-packages\numpy\core\_methods.py",
> line 65, in _mean
>     ret = umr_sum(arr, axis, dtype, out, keepdims)
> TypeError: cannot perform reduce with flexible type
>
> >>> a[['b', 'c']].view(('f8', 2)).mean(0)
> array([ 1.,  1.])
> >>> a[['b', 'c']].view(('f8', 2)).dtype
> dtype('float64')
>
>
> Aside The plan is that statsmodels will drop all usage and support for
> rec_arays/structured dtypes
> in the following release (0.10).
> Then structured dtypes are free (from our perspective) to provide low
> level struct support
> instead of pretending to be dataframe_like.
>
> Josef
>
>
>
>> Feel free to also discuss the related proposed change, to make
>> np.diag return a view instead of a copy. That change has
>> not been implemented yet, only proposed.
>
>
>> Cheers,
>> Allan
>> _______________________________________________
>> NumPy-Discussion mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
>

_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Multiple-field indexing: view vs copy in 1.14+

Reply via email to