[jira] [Work logged] (BEAM-12533) DeferredSeries and DeferredDataFrame should have a useful repr

ASF GitHub Bot (Jira) Tue, 29 Jun 2021 08:42:05 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-12533?focusedWorklogId=616555&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-616555
 ]


ASF GitHub Bot logged work on BEAM-12533:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 29/Jun/21 15:41
            Start Date: 29/Jun/21 15:41
    Worklog Time Spent: 10m 
      Work Description: TheNeuralBit commented on a change in pull request 
#15089:
URL: https://github.com/apache/beam/pull/15089#discussion_r660743021



##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -143,6 +143,14 @@ def wrapper(self, *args, **kwargs):
 
 
 class DeferredDataFrameOrSeries(frame_base.DeferredFrame):
+  def _render_indexes(self):

Review comment:
       > It may be better to keep things simple here and do `indexes={...}` 
even for a single index. Though TBH I am not sure why `index.name` and 
`index.names` are separate attributes (is this different between a pandas index 
and a deferred index)?
   
   This is true in pandas as well. If a DataFrame or Series has multiple 
indexes a separate type is used 
([MultiIndex](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html)),
 which doesn't have the `name` attribute. The `names` attribute always works 
though, it's just a single element array in the non-MultiIndex case.
   
   I tend to agree it would be nice to keep this simpler. I'm hesitant though 
since I think the single index case is much more common, so I want it to look 
correct.
    
   > Related question, why can't we use `repr(index)`? I wonder if there is 
more to an index than just its name? For example, if there are different types 
of index (which I gather there are) could that information be useful to the 
user?
   
   Yeah good question. A pandas index does have more information: a type, plus 
actual data. pandas's `repr(index)` shows all of this (name, type, and data), 
e.g.:
   
   ```
   In [4]: df.set_index('bazzy').index
   Out[4]: Int64Index([1, 2, 3, 4, 5, 6], dtype='int64', name='bazzy')
   
   In [9]: df.set_index(['bazzy', 'barbar']).index
   Out[9]: 
   MultiIndex([(1, 'A'),
               (2, 'B'),
               (3, 'C'),
               (4, 'A'),
               (5, 'B'),
               (6, 'C')],
              names=['bazzy', 'barbar'])
   ```
   
   In the Beam case we don't have any data to show, so `repr(index)` should 
show just the name and type. However I don't think it would make sense to 
re-use this in the repr for DataFrame and Series:
   1. DataFrame columns also have type information, if we include index types 
we should include the column types as well, which is getting pretty verbose 
(without a more formatted output).
   2. pandas doesn't show types for columns or indexes in it's repr for 
DataFrame.
   
   Note users can always inspect other attributes to get type information, e.g. 
with `df.dtypes`.
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 616555)
    Time Spent: 5h 40m  (was: 5.5h)

> DeferredSeries and DeferredDataFrame should have a useful repr
> --------------------------------------------------------------
>
>                 Key: BEAM-12533
>                 URL: https://issues.apache.org/jira/browse/BEAM-12533
>             Project: Beam
>          Issue Type: Improvement
>          Components: dsl-dataframe
>            Reporter: Brian Hulette
>            Assignee: Brian Hulette
>            Priority: P2
>             Fix For: 2.32.0
>
>          Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> DeferredSeries and DeferredDataFrame just use the default __repr__ 
> implementation right now, which means outputting them in a notebook is not 
> useful at all. Users will need to inspect columns, dtypes, index, name, etc.. 
> manually. We should include basic information about the frames in a simple 
> __repr__ implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-12533) DeferredSeries and DeferredDataFrame should have a useful repr

Reply via email to