Re: [Numpy-discussion] Setting custom dtypes and 1.14

josef . pktd Tue, 30 Jan 2018 11:49:55 -0800

On Tue, Jan 30, 2018 at 2:42 PM, <josef.p...@gmail.com> wrote:

>
>
> On Tue, Jan 30, 2018 at 1:33 PM, <josef.p...@gmail.com> wrote:
>
>>
>>
>> On Tue, Jan 30, 2018 at 12:28 PM, Allan Haldane <allanhald...@gmail.com>
>> wrote:
>>
>>> On 01/29/2018 11:50 PM, josef.p...@gmail.com wrote:
>>>
>>>>
>>>>
>>>> On Mon, Jan 29, 2018 at 10:44 PM, Allan Haldane <allanhald...@gmail.com
>>>> <mailto:allanhald...@gmail.com>> wrote:
>>>>
>>>>     On 01/29/2018 05:59 PM, josef.p...@gmail.com
>>>>     <mailto:josef.p...@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>>         On Mon, Jan 29, 2018 at 5:50 PM, <josef.p...@gmail.com
>>>>         <mailto:josef.p...@gmail.com> <mailto:josef.p...@gmail.com
>>>>         <mailto:josef.p...@gmail.com>>> wrote:
>>>>
>>>>
>>>>
>>>>              On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane
>>>>              <allanhald...@gmail.com <mailto:allanhald...@gmail.com>
>>>>         <mailto:allanhald...@gmail.com <mailto:allanhald...@gmail.com
>>>> >>>
>>>>         wrote:
>>>>
>>>>                  On 01/29/2018 04:02 PM, josef.p...@gmail.com
>>>>         <mailto:josef.p...@gmail.com>
>>>>                  <mailto:josef.p...@gmail.com
>>>>         <mailto:josef.p...@gmail.com>> wrote:
>>>>                  >
>>>>                  >
>>>>                  > On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root
>>>>         <ben.v.r...@gmail.com <mailto:ben.v.r...@gmail.com>
>>>>         <mailto:ben.v.r...@gmail.com <mailto:ben.v.r...@gmail.com>>
>>>>                  > <mailto:ben.v.r...@gmail.com
>>>>         <mailto:ben.v.r...@gmail.com> <mailto:ben.v.r...@gmail.com
>>>>         <mailto:ben.v.r...@gmail.com>>>> wrote:
>>>>                  >
>>>>                  >     I <3 structured arrays. I love the fact that I
>>>>         can access data by
>>>>                  >     row and then by fieldname, or vice versa. There
>>>>         are times when I
>>>>                  >     need to pass just a column into a function, and
>>>>         there are times when
>>>>                  >     I need to process things row by row. Yes, pandas
>>>>         is nice if you want
>>>>                  >     the specialized indexing features, but it becomes
>>>>         a bear to deal
>>>>                  >     with if all you want is normal indexing, or even
>>>>         the ability to
>>>>                  >     easily loop over the dataset.
>>>>                  >
>>>>                  >
>>>>                  > I don't think there is a doubt that structured
>>>>         arrays, arrays with
>>>>                  > structured dtypes, are a useful container. The
>>>>         question is whether they
>>>>                  > should be more or the foundation for more.
>>>>                  >
>>>>                  > For example, computing a mean, or reduce operation,
>>>>         over numeric element
>>>>                  > ("columns"). Before padded views it was possible to
>>>>         index by selecting
>>>>                  > the relevant "columns" and view them as standard
>>>>         array. With padded
>>>>                  > views that breaks and AFAICS, there is no way in
>>>>         numpy 1.14.0 to compute
>>>>                  > a mean of some "columns". (I don't have numpy 1.14 to
>>>>         try or find a
>>>>                  > workaround, like maybe looping over all relevant
>>>>         columns.)
>>>>                  >
>>>>                  > Josef
>>>>
>>>>                  Just to clarify, structured types have always had
>>>>         padding bytes,
>>>>                  that
>>>>                  isn't new.
>>>>
>>>>                  What *is* new (which we are pushing to 1.15, I think)
>>>>         is that it
>>>>                  may be
>>>>                  somewhat more common to end up with padding than
>>>>         before, and
>>>>                  only if you
>>>>                  are specifically using multi-field indexing, which is a
>>>>         fairly
>>>>                  specialized case.
>>>>
>>>>                  I think recfunctions already account properly for
>>>>         padding bytes.
>>>>                  Except
>>>>                  for the bug in #8100, which we will fix, padding-bytes
>>>> in
>>>>                  recarrays are
>>>>                  more or less invisible to a non-expert who only cares
>>>> about
>>>>                  dataframe-like behavior.
>>>>
>>>>                  In other words, padding is no obstacle at all to
>>>>         computing a
>>>>                  mean over a
>>>>                  column, and single-field indexes in 1.15 behave
>>>>         identically as
>>>>                  before.
>>>>                  The only thing that will change in 1.15 is multi-field
>>>>         indexing,
>>>>                  and it
>>>>                  has never been possible to compute a mean (or any
>>>> binary
>>>>                  operation) on
>>>>                  multiple fields.
>>>>
>>>>
>>>>              from the example in the other thread
>>>>              a[['b', 'c']].view(('f8', 2)).mean(0)
>>>>
>>>>
>>>>              (from the statsmodels usecase:
>>>>              read csv with genfromtext to get recarray or structured
>>>> array
>>>>              select/index the numeric columns
>>>>              view them as standard array
>>>>              do whatever we can do with standard numpy  arrays
>>>>              )
>>>>
>>>>
>>>>     Oh ok, I misunderstood. I see your point: a mean over fields is more
>>>>     difficult than before.
>>>>
>>>>         Or, to phrase it as a question:
>>>>
>>>>         How do we get a standard array with homogeneous dtype from the
>>>>         corresponding elements of a structured dtype in numpy 1.14.0?
>>>>
>>>>         Josef
>>>>
>>>>
>>>>     The answer may be that "numpy has never had a way to that",
>>>>     even if in a few special cases you might hack a workaround using
>>>> views.
>>>>
>>>>     That's what your example seems like to me. It uses an explicit view,
>>>>     which is an "expert" feature since views depend on the exact memory
>>>>     layout and binary representation of the array. Your example only
>>>>     works if the two fields have exactly the same dtype as each other
>>>>     and as the final dtype, and evidently breaks if there is byte
>>>>     padding for any reason.
>>>>
>>>>     Pandas can do row means without these problems:
>>>>
>>>>          >>> pd.DataFrame(np.ones(10, dtype='i8,f8')).mean(axis=0)
>>>>
>>>>     Numpy is missing this functionality, so you or whoever wrote that
>>>>     example figured out a fragile workaround using views.
>>>>
>>>>
>>>> Once upon a time (*) this wasn't fragile but the only and recommended
>>>> way. Because dtypes were low level with clear memory layout and stayed that
>>>> way, it was easy to check item size or whatever and get different views on
>>>> it.
>>>> e.g. https://mail.scipy.org/pipermail/numpy-discussion/2008-Decem
>>>> ber/039340.html
>>>>
>>>> (*) pre-pandas, pre-stackoverflow on the mailing lists which was for me
>>>> roughly 2008 to 2012
>>>> but a late thread https://mail.scipy.org/piperma
>>>> il/numpy-discussion/2015-October/074014.html
>>>> "What is now the recommended way of converting structured
>>>> dtypes/recarrays to ndarrays?"
>>>>
>>>>
> on final historical note  (once upon a time users relied on cookbooks)
> http://scipy-cookbook.readthedocs.io/items/Recarray.html#
> Converting-to-regular-arrays-and-reshaping
> 2010-03-09 (last modified), 2008-06-27 (created)
> which I assume is broken in numpy 1.4.0
>


and a final grumpy note

https://docs.scipy.org/doc/numpy-1.14.0/release.html#multiple-field-indexing-assignment-of-structured-arrays

" which will affect code such as"    =  "which will break your code without
offering an alternative"


Josef
<back to regular scheduled topics>



>
>
>
>>
>>>>
>>>>
>>>>     I suggest that if we want to allow either means over fields, or
>>>>     conversion of a n-D structured array to an n+1-D regular ndarray, we
>>>>     should add a dedicated function to do so in numpy.lib.recfunctions
>>>>     which does not depend on the binary representation of the array.
>>>>
>>>>
>>>> I don't really want to defend an obsolete (?) usecase of structured
>>>> dtypes.
>>>>
>>>> However, I think there should be a decision about the future plans for
>>>> whether dataframe like usages of structure dtypes or through higher level
>>>> classes or functions are still supported, instead of removing slowly and
>>>> silently (*) the foundation for this use case, either support this usage or
>>>> say you will be dropping it.
>>>>
>>>> (*) I didn't read the details of the release notes
>>>>
>>>>
>>>> And another footnote about obsolete:
>>>> Given that I'm the only one arguing about the dataframe_like usecase of
>>>> recarrays and structured dtypes, I think they are dead for this specific
>>>> usecase and only my inertia and conservativeness kept them alive in
>>>> statsmodels.
>>>>
>>>>
>>>> Josef
>>>>
>>>
>>> It's a bit of a stretch to say that we are "silently" dropping support
>>> for dataframe-like use of structured arrays.
>>>
>>> First, we still allow pretty much all dataframe-like use we have
>>> supported since numpy 1.7, limited as it may be. We are really only
>>> dropping one very specialized, expert use involving an explicit view, which
>>> I still have doubts was ever more than a hack. That 2008 mailing list
>>> message didn't involve multi-field indexing, which didn't exist then (only
>>> introduced in 2009), and we have wanted to make them views (not copies)
>>> since their inception.
>>>
>>
>> The 2008 mailing list thread introduced me to the working with views on
>> structured arrays as the ONLY way to switch between structured and
>> homogenous dtypes (if the underlying item size was homogeneous).
>> The new stats.models started in 2009.
>>
>>
>>>
>>> Second, I don't think we are doing so silently: We have warned about
>>> this in release notes since numpy 1.7 in 2012/2013, and it gets mention in
>>> most releases since then. We have also raised FutureWarnings about it since
>>> 1.7. Unfortunately we missed warning in your specific case for a while, but
>>> we corrected this in 1.12 so you should have seen FutureWarnings since then.
>>>
>>
>> If I see warnings in the test suite about getting a view instead copy
>> from numpy, then the only/main consequence I think about is whether I need
>> to watch out for inline modification.
>> I didn't expect that the followup computation would change, and that it's
>> a padded view and not a view on the selected memory. However, I just
>> checked and padding is mentioned in the 1.12 release notes (which I never
>> read before, ).
>>
>> AFAICS, one problem is that the padded view didn't come with the matching
>> down stream usage support, the pack function as mentioned, an alternative
>> way to convert to a standard ndarray, copy doesn't get rid of the padding
>> and so on.
>>
>> eg. another mailing list thread I just found with the same problem
>> http://numpy-discussion.10968.n7.nabble.com/view-of-recarray
>> -issue-td32001.html
>>
>> quoting Ralf:
>> Question: is that really the recommended way to get an (N, 2) size float
>> array from two columns of a larger record array? If so, why isn't there a
>> better way? If you'd want to write to that (N, 2) array you have to append
>> a copy, making it even uglier. Also, then there really should be tests for
>> views in test_records.py.
>>
>>
>> This "better way" never showed up, AFAIK. And it looks like we came back
>> to this problem every few years.
>>
>> Josef
>>
>>
>>>
>>> I don't feel the need to officially declare that we are dropping support
>>> for dataframe-like use of structured arrays. It's unclear where that use
>>> ends and other uses of structured arrays begin. I think updating the docs
>>> to warn that pandas/dask may be a better choice is enough, as I've been
>>> doing, and then users can decide for themselves.
>>
>>
>>> There is still the question about whether we should make
>>> numpy.lib.recfunctions more official. I don't have a strong opinion. I
>>> suppose it would be good to add a section to the structured array docs
>>> which lists those methods and says something like
>>>
>>> "the submodule numpy.lib.recfunctions provides minimal functionality to
>>> split, combine, and manipulate structured datatypes and arrays. In most
>>> cases, we strongly recommend users use a dedicated module such as
>>> pandas/xarray/dask instead of these methods, but they are provided for
>>> occasional convenience."
>>>
>>> Allan
>>>
>>>
>>>
>>>     Allan
>>>>
>>>>
>>>>              Josef
>>>>
>>>>
>>>>                  Allan
>>>>
>>>>                  >
>>>>                  >     Cheers!
>>>>                  >     Ben Root
>>>>                  >
>>>>                  >     On Mon, Jan 29, 2018 at 3:24 PM,
>>>>         <josef.p...@gmail.com <mailto:josef.p...@gmail.com>
>>>>         <mailto:josef.p...@gmail.com <mailto:josef.p...@gmail.com>>
>>>>                  >     <mailto:josef.p...@gmail.com
>>>>         <mailto:josef.p...@gmail.com> <mailto:josef.p...@gmail.com
>>>>         <mailto:josef.p...@gmail.com>>>> wrote:
>>>>                  >
>>>>                  >
>>>>                  >
>>>>                  >         On Mon, Jan 29, 2018 at 2:55 PM, Stefan van
>>>>         der Walt
>>>>                  >         <stef...@berkeley.edu
>>>>         <mailto:stef...@berkeley.edu> <mailto:stef...@berkeley.edu
>>>>         <mailto:stef...@berkeley.edu>>
>>>>                  <mailto:stef...@berkeley.edu
>>>>         <mailto:stef...@berkeley.edu> <mailto:stef...@berkeley.edu
>>>>         <mailto:stef...@berkeley.edu>>>> wrote:
>>>>                  >
>>>>                  >             On Mon, 29 Jan 2018 14:10:56 -0500,
>>>>         josef.p...@gmail.com <mailto:josef.p...@gmail.com>
>>>>         <mailto:josef.p...@gmail.com <mailto:josef.p...@gmail.com>>
>>>>                   >             <mailto:josef.p...@gmail.com
>>>>         <mailto:josef.p...@gmail.com>
>>>>
>>>>                  <mailto:josef.p...@gmail.com
>>>>         <mailto:josef.p...@gmail.com>>> wrote:
>>>>                   >
>>>>                   >                 Given that there is pandas, xarray,
>>>>         dask and
>>>>                  more, numpy
>>>>                   >                 could as well drop
>>>>                   >                 any pretense of supporting
>>>>         dataframe_likes.
>>>>                  Or, adjust
>>>>                   >                 the recfunctions so
>>>>                   >                 we can still work dataframe_like
>>>>         with structured
>>>>                   >                 dtypes/recarrays/recfunctions.
>>>>                   >
>>>>                   >
>>>>                   >             I haven't been following the duckarray
>>>>         discussion
>>>>                  carefully,
>>>>                   >             but could
>>>>                   >             this be an opportunity for a dataframe
>>>>         protocol,
>>>>                  so that we
>>>>                   >             can have
>>>>                   >             libraries ingest structured arrays,
>>>> record
>>>>                  arrays, pandas
>>>>                   >             dataframes,
>>>>                   >             etc. without too much specialized code?
>>>>                   >
>>>>                   >
>>>>                   >         AFAIU while not being in the data handling
>>>> area,
>>>>                  pandas defines
>>>>                   >         the interface and other libraries provide
>>>> pandas
>>>>                  compatible
>>>>                   >         interfaces or implementations.
>>>>                   >
>>>>                   >         statsmodels currently still has recarray
>>>>         support and
>>>>                  usage. In
>>>>                   >         some interfaces we support pandas,
>>>> recarrays and
>>>>                  plain arrays,
>>>>                   >         or anything where asarray works correctly.
>>>>                   >
>>>>                   >         But recarrays became messy to support, one
>>>>         rewrite of
>>>>                  some
>>>>                   >         functions last year converts recarrays to
>>>>         pandas,
>>>>                  does the
>>>>                   >         manipulation and then converts back to
>>>>         recarrays.
>>>>                   >         Also we need to adjust our recarray usage
>>>>         with new numpy
>>>>                   >         versions. But there is no real benefit
>>>> because I
>>>>                  doubt that
>>>>                   >         statsmodels still has any
>>>>         recarray/structured dtype
>>>>                  users. So,
>>>>                   >         we only have to remove our own uses in the
>>>>         datasets
>>>>                  and unit tests.
>>>>                   >
>>>>                   >         Josef
>>>>                   >
>>>>                   >
>>>>                   >
>>>>                   >
>>>>                   >             Stéfan
>>>>                   >
>>>>                   >                     _____________________________
>>>> __________________
>>>>                   >             NumPy-Discussion mailing list
>>>>                   > NumPy-Discussion@python.org
>>>>         <mailto:NumPy-Discussion@python.org>
>>>>                  <mailto:NumPy-Discussion@python.org
>>>>         <mailto:NumPy-Discussion@python.org>>
>>>>                  <mailto:NumPy-Discussion@python.org
>>>>         <mailto:NumPy-Discussion@python.org>
>>>>                  <mailto:NumPy-Discussion@python.org
>>>>         <mailto:NumPy-Discussion@python.org>>>
>>>>                   >
>>>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>>                         <https://mail.python.org/mailm
>>>> an/listinfo/numpy-discussion
>>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>
>>>>                  >                     <https://mail.python.org/mail
>>>> man/listinfo/numpy-discussion
>>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>>                         <https://mail.python.org/mailm
>>>> an/listinfo/numpy-discussion
>>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>>
>>>>                  >
>>>>                  >
>>>>                  >
>>>>                  >         _____________________________
>>>> __________________
>>>>                  >         NumPy-Discussion mailing list
>>>>                   > NumPy-Discussion@python.org
>>>>         <mailto:NumPy-Discussion@python.org>
>>>>                  <mailto:NumPy-Discussion@python.org
>>>>         <mailto:NumPy-Discussion@python.org>>
>>>>                  <mailto:NumPy-Discussion@python.org
>>>>         <mailto:NumPy-Discussion@python.org>
>>>>                  <mailto:NumPy-Discussion@python.org
>>>>         <mailto:NumPy-Discussion@python.org>>>
>>>>                   >
>>>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>>                         <https://mail.python.org/mailm
>>>> an/listinfo/numpy-discussion
>>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>
>>>>                  >                 <https://mail.python.org/mail
>>>> man/listinfo/numpy-discussion
>>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>>                         <https://mail.python.org/mailm
>>>> an/listinfo/numpy-discussion
>>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>>
>>>>                  >
>>>>                  >
>>>>                  >
>>>>                  >     _______________________________________________
>>>>                  >     NumPy-Discussion mailing list
>>>>                   > NumPy-Discussion@python.org
>>>>         <mailto:NumPy-Discussion@python.org>
>>>>                  <mailto:NumPy-Discussion@python.org
>>>>         <mailto:NumPy-Discussion@python.org>>
>>>>                  <mailto:NumPy-Discussion@python.org
>>>>         <mailto:NumPy-Discussion@python.org>
>>>>                  <mailto:NumPy-Discussion@python.org
>>>>         <mailto:NumPy-Discussion@python.org>>>
>>>>                   >
>>>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>>                         <https://mail.python.org/mailm
>>>> an/listinfo/numpy-discussion
>>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>
>>>>                   >                     <https://mail.python.org/mail
>>>> man/listinfo/numpy-discussion
>>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>>                         <https://mail.python.org/mailm
>>>> an/listinfo/numpy-discussion
>>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>>
>>>>                   >
>>>>                   >
>>>>                   >
>>>>                   >
>>>>                   > _______________________________________________
>>>>                   > NumPy-Discussion mailing list
>>>>                   > NumPy-Discussion@python.org
>>>>         <mailto:NumPy-Discussion@python.org>
>>>>         <mailto:NumPy-Discussion@python.org
>>>>         <mailto:NumPy-Discussion@python.org>>
>>>>                   >
>>>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>>                         <https://mail.python.org/mailm
>>>> an/listinfo/numpy-discussion
>>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>
>>>>                   >
>>>>
>>>>                  _______________________________________________
>>>>                  NumPy-Discussion mailing list
>>>>         NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org
>>>> >
>>>>         <mailto:NumPy-Discussion@python.org
>>>>         <mailto:NumPy-Discussion@python.org>>
>>>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>>                         <https://mail.python.org/mailm
>>>> an/listinfo/numpy-discussion
>>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>         _______________________________________________
>>>>         NumPy-Discussion mailing list
>>>>         NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org
>>>> >
>>>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>>
>>>>
>>>>     _______________________________________________
>>>>     NumPy-Discussion mailing list
>>>>     NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>
>>>>     https://mail.python.org/mailman/listinfo/numpy-discussion
>>>>     <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion@python.org
>>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>>
>>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@python.org
>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>
>>
>>
>

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Setting custom dtypes and 1.14

Reply via email to