[
https://issues.apache.org/jira/browse/BEAM-12533?focusedWorklogId=616555&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-616555
]
ASF GitHub Bot logged work on BEAM-12533:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 29/Jun/21 15:41
Start Date: 29/Jun/21 15:41
Worklog Time Spent: 10m
Work Description: TheNeuralBit commented on a change in pull request
#15089:
URL: https://github.com/apache/beam/pull/15089#discussion_r660743021
##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -143,6 +143,14 @@ def wrapper(self, *args, **kwargs):
class DeferredDataFrameOrSeries(frame_base.DeferredFrame):
+ def _render_indexes(self):
Review comment:
> It may be better to keep things simple here and do `indexes={...}`
even for a single index. Though TBH I am not sure why `index.name` and
`index.names` are separate attributes (is this different between a pandas index
and a deferred index)?
This is true in pandas as well. If a DataFrame or Series has multiple
indexes a separate type is used
([MultiIndex](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html)),
which doesn't have the `name` attribute. The `names` attribute always works
though, it's just a single element array in the non-MultiIndex case.
I tend to agree it would be nice to keep this simpler. I'm hesitant though
since I think the single index case is much more common, so I want it to look
correct.
> Related question, why can't we use `repr(index)`? I wonder if there is
more to an index than just its name? For example, if there are different types
of index (which I gather there are) could that information be useful to the
user?
Yeah good question. A pandas index does have more information: a type, plus
actual data. pandas's `repr(index)` shows all of this (name, type, and data),
e.g.:
```
In [4]: df.set_index('bazzy').index
Out[4]: Int64Index([1, 2, 3, 4, 5, 6], dtype='int64', name='bazzy')
In [9]: df.set_index(['bazzy', 'barbar']).index
Out[9]:
MultiIndex([(1, 'A'),
(2, 'B'),
(3, 'C'),
(4, 'A'),
(5, 'B'),
(6, 'C')],
names=['bazzy', 'barbar'])
```
In the Beam case we don't have any data to show, so `repr(index)` should
show just the name and type. However I don't think it would make sense to
re-use this in the repr for DataFrame and Series:
1. DataFrame columns also have type information, if we include index types
we should include the column types as well, which is getting pretty verbose
(without a more formatted output).
2. pandas doesn't show types for columns or indexes in it's repr for
DataFrame.
Note users can always inspect other attributes to get type information, e.g.
with `df.dtypes`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 616555)
Time Spent: 5h 40m (was: 5.5h)
> DeferredSeries and DeferredDataFrame should have a useful repr
> --------------------------------------------------------------
>
> Key: BEAM-12533
> URL: https://issues.apache.org/jira/browse/BEAM-12533
> Project: Beam
> Issue Type: Improvement
> Components: dsl-dataframe
> Reporter: Brian Hulette
> Assignee: Brian Hulette
> Priority: P2
> Fix For: 2.32.0
>
> Time Spent: 5h 40m
> Remaining Estimate: 0h
>
> DeferredSeries and DeferredDataFrame just use the default __repr__
> implementation right now, which means outputting them in a notebook is not
> useful at all. Users will need to inspect columns, dtypes, index, name, etc..
> manually. We should include basic information about the frames in a simple
> __repr__ implementation.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)