Re: [Numpy-discussion] Setting custom dtypes and 1.14

josef . pktd Tue, 30 Jan 2018 11:52:55 -0800

On Tue, Jan 30, 2018 at 1:33 PM, <[email protected]> wrote:

>
>
> On Tue, Jan 30, 2018 at 12:28 PM, Allan Haldane <[email protected]>
> wrote:
>
>> On 01/29/2018 11:50 PM, [email protected] wrote:
>>
>>>
>>>
>>> On Mon, Jan 29, 2018 at 10:44 PM, Allan Haldane <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>     On 01/29/2018 05:59 PM, [email protected]
>>>     <mailto:[email protected]> wrote:
>>>
>>>
>>>
>>>         On Mon, Jan 29, 2018 at 5:50 PM, <[email protected]
>>>         <mailto:[email protected]> <mailto:[email protected]
>>>         <mailto:[email protected]>>> wrote:
>>>
>>>
>>>
>>>              On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane
>>>              <[email protected] <mailto:[email protected]>
>>>         <mailto:[email protected] <mailto:[email protected]>>>
>>>         wrote:
>>>
>>>                  On 01/29/2018 04:02 PM, [email protected]
>>>         <mailto:[email protected]>
>>>                  <mailto:[email protected]
>>>         <mailto:[email protected]>> wrote:
>>>                  >
>>>                  >
>>>                  > On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root
>>>         <[email protected] <mailto:[email protected]>
>>>         <mailto:[email protected] <mailto:[email protected]>>
>>>                  > <mailto:[email protected]
>>>         <mailto:[email protected]> <mailto:[email protected]
>>>         <mailto:[email protected]>>>> wrote:
>>>                  >
>>>                  >     I <3 structured arrays. I love the fact that I
>>>         can access data by
>>>                  >     row and then by fieldname, or vice versa. There
>>>         are times when I
>>>                  >     need to pass just a column into a function, and
>>>         there are times when
>>>                  >     I need to process things row by row. Yes, pandas
>>>         is nice if you want
>>>                  >     the specialized indexing features, but it becomes
>>>         a bear to deal
>>>                  >     with if all you want is normal indexing, or even
>>>         the ability to
>>>                  >     easily loop over the dataset.
>>>                  >
>>>                  >
>>>                  > I don't think there is a doubt that structured
>>>         arrays, arrays with
>>>                  > structured dtypes, are a useful container. The
>>>         question is whether they
>>>                  > should be more or the foundation for more.
>>>                  >
>>>                  > For example, computing a mean, or reduce operation,
>>>         over numeric element
>>>                  > ("columns"). Before padded views it was possible to
>>>         index by selecting
>>>                  > the relevant "columns" and view them as standard
>>>         array. With padded
>>>                  > views that breaks and AFAICS, there is no way in
>>>         numpy 1.14.0 to compute
>>>                  > a mean of some "columns". (I don't have numpy 1.14 to
>>>         try or find a
>>>                  > workaround, like maybe looping over all relevant
>>>         columns.)
>>>                  >
>>>                  > Josef
>>>
>>>                  Just to clarify, structured types have always had
>>>         padding bytes,
>>>                  that
>>>                  isn't new.
>>>
>>>                  What *is* new (which we are pushing to 1.15, I think)
>>>         is that it
>>>                  may be
>>>                  somewhat more common to end up with padding than
>>>         before, and
>>>                  only if you
>>>                  are specifically using multi-field indexing, which is a
>>>         fairly
>>>                  specialized case.
>>>
>>>                  I think recfunctions already account properly for
>>>         padding bytes.
>>>                  Except
>>>                  for the bug in #8100, which we will fix, padding-bytes
>>> in
>>>                  recarrays are
>>>                  more or less invisible to a non-expert who only cares
>>> about
>>>                  dataframe-like behavior.
>>>
>>>                  In other words, padding is no obstacle at all to
>>>         computing a
>>>                  mean over a
>>>                  column, and single-field indexes in 1.15 behave
>>>         identically as
>>>                  before.
>>>                  The only thing that will change in 1.15 is multi-field
>>>         indexing,
>>>                  and it
>>>                  has never been possible to compute a mean (or any binary
>>>                  operation) on
>>>                  multiple fields.
>>>
>>>
>>>              from the example in the other thread
>>>              a[['b', 'c']].view(('f8', 2)).mean(0)
>>>
>>>
>>>              (from the statsmodels usecase:
>>>              read csv with genfromtext to get recarray or structured
>>> array
>>>              select/index the numeric columns
>>>              view them as standard array
>>>              do whatever we can do with standard numpy  arrays
>>>              )
>>>
>>>
>>>     Oh ok, I misunderstood. I see your point: a mean over fields is more
>>>     difficult than before.
>>>
>>>         Or, to phrase it as a question:
>>>
>>>         How do we get a standard array with homogeneous dtype from the
>>>         corresponding elements of a structured dtype in numpy 1.14.0?
>>>
>>>         Josef
>>>
>>>
>>>     The answer may be that "numpy has never had a way to that",
>>>     even if in a few special cases you might hack a workaround using
>>> views.
>>>
>>>     That's what your example seems like to me. It uses an explicit view,
>>>     which is an "expert" feature since views depend on the exact memory
>>>     layout and binary representation of the array. Your example only
>>>     works if the two fields have exactly the same dtype as each other
>>>     and as the final dtype, and evidently breaks if there is byte
>>>     padding for any reason.
>>>
>>>     Pandas can do row means without these problems:
>>>
>>>          >>> pd.DataFrame(np.ones(10, dtype='i8,f8')).mean(axis=0)
>>>
>>>     Numpy is missing this functionality, so you or whoever wrote that
>>>     example figured out a fragile workaround using views.
>>>
>>>
>>> Once upon a time (*) this wasn't fragile but the only and recommended
>>> way. Because dtypes were low level with clear memory layout and stayed that
>>> way, it was easy to check item size or whatever and get different views on
>>> it.
>>> e.g. https://mail.scipy.org/pipermail/numpy-discussion/2008-Decem
>>> ber/039340.html
>>>
>>> (*) pre-pandas, pre-stackoverflow on the mailing lists which was for me
>>> roughly 2008 to 2012
>>> but a late thread https://mail.scipy.org/piperma
>>> il/numpy-discussion/2015-October/074014.html
>>> "What is now the recommended way of converting structured
>>> dtypes/recarrays to ndarrays?"
>>>
>>>
on final historical note  (once upon a time users relied on cookbooks)
http://scipy-cookbook.readthedocs.io/items/Recarray.
html#Converting-to-regular-arrays-and-reshaping
2010-03-09 (last modified), 2008-06-27 (created)
which I assume is broken in numpy 1.4.0




>
>>>
>>>
>>>     I suggest that if we want to allow either means over fields, or
>>>     conversion of a n-D structured array to an n+1-D regular ndarray, we
>>>     should add a dedicated function to do so in numpy.lib.recfunctions
>>>     which does not depend on the binary representation of the array.
>>>
>>>
>>> I don't really want to defend an obsolete (?) usecase of structured
>>> dtypes.
>>>
>>> However, I think there should be a decision about the future plans for
>>> whether dataframe like usages of structure dtypes or through higher level
>>> classes or functions are still supported, instead of removing slowly and
>>> silently (*) the foundation for this use case, either support this usage or
>>> say you will be dropping it.
>>>
>>> (*) I didn't read the details of the release notes
>>>
>>>
>>> And another footnote about obsolete:
>>> Given that I'm the only one arguing about the dataframe_like usecase of
>>> recarrays and structured dtypes, I think they are dead for this specific
>>> usecase and only my inertia and conservativeness kept them alive in
>>> statsmodels.
>>>
>>>
>>> Josef
>>>
>>
>> It's a bit of a stretch to say that we are "silently" dropping support
>> for dataframe-like use of structured arrays.
>>
>> First, we still allow pretty much all dataframe-like use we have
>> supported since numpy 1.7, limited as it may be. We are really only
>> dropping one very specialized, expert use involving an explicit view, which
>> I still have doubts was ever more than a hack. That 2008 mailing list
>> message didn't involve multi-field indexing, which didn't exist then (only
>> introduced in 2009), and we have wanted to make them views (not copies)
>> since their inception.
>>
>
> The 2008 mailing list thread introduced me to the working with views on
> structured arrays as the ONLY way to switch between structured and
> homogenous dtypes (if the underlying item size was homogeneous).
> The new stats.models started in 2009.
>
>
>>
>> Second, I don't think we are doing so silently: We have warned about this
>> in release notes since numpy 1.7 in 2012/2013, and it gets mention in most
>> releases since then. We have also raised FutureWarnings about it since 1.7.
>> Unfortunately we missed warning in your specific case for a while, but we
>> corrected this in 1.12 so you should have seen FutureWarnings since then.
>>
>
> If I see warnings in the test suite about getting a view instead copy from
> numpy, then the only/main consequence I think about is whether I need to
> watch out for inline modification.
> I didn't expect that the followup computation would change, and that it's
> a padded view and not a view on the selected memory. However, I just
> checked and padding is mentioned in the 1.12 release notes (which I never
> read before, ).
>
> AFAICS, one problem is that the padded view didn't come with the matching
> down stream usage support, the pack function as mentioned, an alternative
> way to convert to a standard ndarray, copy doesn't get rid of the padding
> and so on.
>
> eg. another mailing list thread I just found with the same problem
> http://numpy-discussion.10968.n7.nabble.com/view-of-recarray
> -issue-td32001.html
>
> quoting Ralf:
> Question: is that really the recommended way to get an (N, 2) size float
> array from two columns of a larger record array? If so, why isn't there a
> better way? If you'd want to write to that (N, 2) array you have to append
> a copy, making it even uglier. Also, then there really should be tests for
> views in test_records.py.
>
>
> This "better way" never showed up, AFAIK. And it looks like we came back
> to this problem every few years.
>
> Josef
>
>
>>
>> I don't feel the need to officially declare that we are dropping support
>> for dataframe-like use of structured arrays. It's unclear where that use
>> ends and other uses of structured arrays begin. I think updating the docs
>> to warn that pandas/dask may be a better choice is enough, as I've been
>> doing, and then users can decide for themselves.
>
>
>> There is still the question about whether we should make
>> numpy.lib.recfunctions more official. I don't have a strong opinion. I
>> suppose it would be good to add a section to the structured array docs
>> which lists those methods and says something like
>>
>> "the submodule numpy.lib.recfunctions provides minimal functionality to
>> split, combine, and manipulate structured datatypes and arrays. In most
>> cases, we strongly recommend users use a dedicated module such as
>> pandas/xarray/dask instead of these methods, but they are provided for
>> occasional convenience."
>>
>> Allan
>>
>>
>>
>>     Allan
>>>
>>>
>>>              Josef
>>>
>>>
>>>                  Allan
>>>
>>>                  >
>>>                  >     Cheers!
>>>                  >     Ben Root
>>>                  >
>>>                  >     On Mon, Jan 29, 2018 at 3:24 PM,
>>>         <[email protected] <mailto:[email protected]>
>>>         <mailto:[email protected] <mailto:[email protected]>>
>>>                  >     <mailto:[email protected]
>>>         <mailto:[email protected]> <mailto:[email protected]
>>>         <mailto:[email protected]>>>> wrote:
>>>                  >
>>>                  >
>>>                  >
>>>                  >         On Mon, Jan 29, 2018 at 2:55 PM, Stefan van
>>>         der Walt
>>>                  >         <[email protected]
>>>         <mailto:[email protected]> <mailto:[email protected]
>>>         <mailto:[email protected]>>
>>>                  <mailto:[email protected]
>>>         <mailto:[email protected]> <mailto:[email protected]
>>>         <mailto:[email protected]>>>> wrote:
>>>                  >
>>>                  >             On Mon, 29 Jan 2018 14:10:56 -0500,
>>>         [email protected] <mailto:[email protected]>
>>>         <mailto:[email protected] <mailto:[email protected]>>
>>>                   >             <mailto:[email protected]
>>>         <mailto:[email protected]>
>>>
>>>                  <mailto:[email protected]
>>>         <mailto:[email protected]>>> wrote:
>>>                   >
>>>                   >                 Given that there is pandas, xarray,
>>>         dask and
>>>                  more, numpy
>>>                   >                 could as well drop
>>>                   >                 any pretense of supporting
>>>         dataframe_likes.
>>>                  Or, adjust
>>>                   >                 the recfunctions so
>>>                   >                 we can still work dataframe_like
>>>         with structured
>>>                   >                 dtypes/recarrays/recfunctions.
>>>                   >
>>>                   >
>>>                   >             I haven't been following the duckarray
>>>         discussion
>>>                  carefully,
>>>                   >             but could
>>>                   >             this be an opportunity for a dataframe
>>>         protocol,
>>>                  so that we
>>>                   >             can have
>>>                   >             libraries ingest structured arrays,
>>> record
>>>                  arrays, pandas
>>>                   >             dataframes,
>>>                   >             etc. without too much specialized code?
>>>                   >
>>>                   >
>>>                   >         AFAIU while not being in the data handling
>>> area,
>>>                  pandas defines
>>>                   >         the interface and other libraries provide
>>> pandas
>>>                  compatible
>>>                   >         interfaces or implementations.
>>>                   >
>>>                   >         statsmodels currently still has recarray
>>>         support and
>>>                  usage. In
>>>                   >         some interfaces we support pandas, recarrays
>>> and
>>>                  plain arrays,
>>>                   >         or anything where asarray works correctly.
>>>                   >
>>>                   >         But recarrays became messy to support, one
>>>         rewrite of
>>>                  some
>>>                   >         functions last year converts recarrays to
>>>         pandas,
>>>                  does the
>>>                   >         manipulation and then converts back to
>>>         recarrays.
>>>                   >         Also we need to adjust our recarray usage
>>>         with new numpy
>>>                   >         versions. But there is no real benefit
>>> because I
>>>                  doubt that
>>>                   >         statsmodels still has any
>>>         recarray/structured dtype
>>>                  users. So,
>>>                   >         we only have to remove our own uses in the
>>>         datasets
>>>                  and unit tests.
>>>                   >
>>>                   >         Josef
>>>                   >
>>>                   >
>>>                   >
>>>                   >
>>>                   >             Stéfan
>>>                   >
>>>                   >                     _____________________________
>>> __________________
>>>                   >             NumPy-Discussion mailing list
>>>                   > [email protected]
>>>         <mailto:[email protected]>
>>>                  <mailto:[email protected]
>>>         <mailto:[email protected]>>
>>>                  <mailto:[email protected]
>>>         <mailto:[email protected]>
>>>                  <mailto:[email protected]
>>>         <mailto:[email protected]>>>
>>>                   >
>>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>                         <https://mail.python.org/mailm
>>> an/listinfo/numpy-discussion
>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>
>>>                  >                     <https://mail.python.org/mail
>>> man/listinfo/numpy-discussion
>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>                         <https://mail.python.org/mailm
>>> an/listinfo/numpy-discussion
>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>>
>>>                  >
>>>                  >
>>>                  >
>>>                  >         _____________________________
>>> __________________
>>>                  >         NumPy-Discussion mailing list
>>>                   > [email protected]
>>>         <mailto:[email protected]>
>>>                  <mailto:[email protected]
>>>         <mailto:[email protected]>>
>>>                  <mailto:[email protected]
>>>         <mailto:[email protected]>
>>>                  <mailto:[email protected]
>>>         <mailto:[email protected]>>>
>>>                   >
>>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>                         <https://mail.python.org/mailm
>>> an/listinfo/numpy-discussion
>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>
>>>                  >                 <https://mail.python.org/mail
>>> man/listinfo/numpy-discussion
>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>                         <https://mail.python.org/mailm
>>> an/listinfo/numpy-discussion
>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>>
>>>                  >
>>>                  >
>>>                  >
>>>                  >     _______________________________________________
>>>                  >     NumPy-Discussion mailing list
>>>                   > [email protected]
>>>         <mailto:[email protected]>
>>>                  <mailto:[email protected]
>>>         <mailto:[email protected]>>
>>>                  <mailto:[email protected]
>>>         <mailto:[email protected]>
>>>                  <mailto:[email protected]
>>>         <mailto:[email protected]>>>
>>>                   >
>>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>                         <https://mail.python.org/mailm
>>> an/listinfo/numpy-discussion
>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>
>>>                   >                     <https://mail.python.org/mail
>>> man/listinfo/numpy-discussion
>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>                         <https://mail.python.org/mailm
>>> an/listinfo/numpy-discussion
>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>>
>>>                   >
>>>                   >
>>>                   >
>>>                   >
>>>                   > _______________________________________________
>>>                   > NumPy-Discussion mailing list
>>>                   > [email protected]
>>>         <mailto:[email protected]>
>>>         <mailto:[email protected]
>>>         <mailto:[email protected]>>
>>>                   >
>>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>                         <https://mail.python.org/mailm
>>> an/listinfo/numpy-discussion
>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>
>>>                   >
>>>
>>>                  _______________________________________________
>>>                  NumPy-Discussion mailing list
>>>         [email protected] <mailto:[email protected]>
>>>         <mailto:[email protected]
>>>         <mailto:[email protected]>>
>>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>                         <https://mail.python.org/mailm
>>> an/listinfo/numpy-discussion
>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>
>>>
>>>
>>>
>>>
>>>
>>>         _______________________________________________
>>>         NumPy-Discussion mailing list
>>>         [email protected] <mailto:[email protected]>
>>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>
>>>
>>>     _______________________________________________
>>>     NumPy-Discussion mailing list
>>>     [email protected] <mailto:[email protected]>
>>>     https://mail.python.org/mailman/listinfo/numpy-discussion
>>>     <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> [email protected]
>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>
>>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
>

_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Setting custom dtypes and 1.14

Reply via email to