[
https://issues.apache.org/jira/browse/BEAM-12169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brian Hulette updated BEAM-12169:
---------------------------------
Description:
There are several operations that we currently disallow because they produce a
variable set of columns in the output based on the data (non-deferred-columns).
However, for some dtypes (categorical, boolean) we can easily enumerate all the
possible values that will be seen at execution time, so we can predict the
columns that will be seen.
We should allow these operations in these special cases.
Operations in this category:
- DataFrame.unstack (can work if unstacked level is a categorical or boolean
column)
- Series.str.get_dummies
- Series.str.split
- Series.str.rsplit
- DataFrame.pivot
- DataFrame.pivot_table
- len(GroupBy) (if groupers are all categorical _and_ observed=False or all
boolean)
was:
There are several operations that we currently disallow because they produce a
variable set of columns in the output based on the data (non-deferred-columns).
However, for some dtypes (categorical, boolean) we can easily enumerate all the
possible values that will be seen at execution time, so we can predict the
columns that will be seen.
We should allow these operations in these special cases.
Operations in this category:
- DataFrame.unstack (can work if unstacked level is a categorical column)
- Series.str.get_dummies
- Series.str.split
- Series.str.rsplit
- DataFrame.pivot
- DataFrame.pivot_table
- len(GroupBy) (if groupers are all categorical _and_ observed=False or all
boolean)
> DataFrame API: Allow non-deferred column operations on categorical columns
> --------------------------------------------------------------------------
>
> Key: BEAM-12169
> URL: https://issues.apache.org/jira/browse/BEAM-12169
> Project: Beam
> Issue Type: Improvement
> Components: sdk-py-core
> Reporter: Brian Hulette
> Priority: P2
> Labels: dataframe-api
>
> There are several operations that we currently disallow because they produce
> a variable set of columns in the output based on the data
> (non-deferred-columns). However, for some dtypes (categorical, boolean) we
> can easily enumerate all the possible values that will be seen at execution
> time, so we can predict the columns that will be seen.
> We should allow these operations in these special cases.
> Operations in this category:
> - DataFrame.unstack (can work if unstacked level is a categorical or boolean
> column)
> - Series.str.get_dummies
> - Series.str.split
> - Series.str.rsplit
> - DataFrame.pivot
> - DataFrame.pivot_table
> - len(GroupBy) (if groupers are all categorical _and_ observed=False or all
> boolean)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)