[
https://issues.apache.org/jira/browse/BEAM-12169?focusedWorklogId=722799&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-722799
]
ASF GitHub Bot logged work on BEAM-12169:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 08/Feb/22 12:52
Start Date: 08/Feb/22 12:52
Worklog Time Spent: 10m
Work Description: yeandy commented on a change in pull request #16615:
URL: https://github.com/apache/beam/pull/16615#discussion_r801597567
##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -4625,9 +4625,48 @@ def repeat(self, repeats):
raise TypeError("str.repeat(repeats=) value must be an int or a "
f"DeferredSeries (encountered {type(repeats)}).")
- get_dummies = frame_base.wont_implement_method(
- pd.core.strings.StringMethods, 'get_dummies',
- reason='non-deferred-columns')
+ @frame_base.with_docs_from(pd.core.strings.StringMethods)
+ @frame_base.args_to_kwargs(pd.core.strings.StringMethods)
+ def get_dummies(self, **kwargs):
+ """
+ Series must be categorical type. Either cast to ``category`` to
+ infer categories, or preferred, cast to ``CategoricalDtype``
+ to ensure correct categories.
Review comment:
Ack on 1.
And for 2, when I was referring to "cast to `category`", I meant that one
could feasibly be doing the following
```
>>> s = pd.Series(['a', 'b', 'a', 'c', 'a', np.nan]).astype('category')
>>> s
0 a
1 b
2 a
3 c
4 a
5 NaN
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> s.dtype
CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
```
resulting in the inferred categorical types. This obviously can be fragile,
but nevertheless technically works, and results in the `dtype` to be
`CategoricalDtype`. If we explicitly define with `CategoricalDtype`, then we
may get:
```
>>> s = pd.Series(['a', 'b', 'a', 'c', 'a', np.nan],
dtype=pd.CategoricalDtype(categories=['b', 'a']))
>>> s
0 a
1 b
2 a
3 NaN
4 a
5 NaN
dtype: category
Categories (2, object): ['b', 'a']
>>> s.dtype
CategoricalDtype(categories=['b', 'a'], ordered=False)
```
Note that in the second example, I didn't pass in the category `c`, which
will result in the unknown category (`NaN`) in row 3. The onus for getting
these categories correct would be on the user.
I can definitely I can add a dedicated documentation section about this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 722799)
Time Spent: 2h 10m (was: 2h)
> Allow non-deferred column operations on categorical columns
> -----------------------------------------------------------
>
> Key: BEAM-12169
> URL: https://issues.apache.org/jira/browse/BEAM-12169
> Project: Beam
> Issue Type: Improvement
> Components: dsl-dataframe, sdk-py-core
> Reporter: Brian Hulette
> Assignee: Andy Ye
> Priority: P3
> Labels: dataframe-api
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> There are several operations that we currently disallow because they produce
> a variable set of columns in the output based on the data
> (non-deferred-columns). However, for some dtypes (categorical, boolean) we
> can easily enumerate all the possible values that will be seen at execution
> time, so we can predict the columns that will be seen.
> Note we still can't implement these operations 100% correctly, as pandas will
> typically only create columns for the values that are _observed_, while we'd
> have to create a column for every possible value.
> We should allow these operations in these special cases.
> Operations in this category:
> - DataFrame.unstack (can work if unstacked level is a categorical or boolean
> column)
> - Series.str.get_dummies
> - Series.str.split
> - Series.str.rsplit
> - DataFrame.pivot
> - DataFrame.pivot_table
> - len(GroupBy) and ngroups
> ** if groupers are all categorical _and_ observed=False or all boolean
> ** Note these two may not actually be equivalent in all cases:
> [https://github.com/pandas-dev/pandas/issues/26326]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)