[jira] [Updated] (BEAM-12169) Allow non-deferred column operations on categorical columns

Brian Hulette (Jira) Fri, 25 Feb 2022 15:19:08 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-12169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Brian Hulette updated BEAM-12169:
---------------------------------
    Description: 
There are several operations that we currently disallow because they produce a 
variable set of columns in the output based on the data (non-deferred-columns). 
However, for some dtypes (categorical, boolean) we can easily enumerate all the 
possible values that will be seen at execution time, so we can predict the 
columns that will be seen.

Note we still can't implement these operations 100% correctly, as pandas will 
typically only create columns for the values that are _observed_, while we'd 
have to create a column for every possible value.

We should allow these operations in these special cases.

Operations in this category:
 - DataFrame.unstack, Series.unstack (can work if unstacked level is a 
categorical or boolean column)
 - Series.str.get_dummies
 - Series.str.split
 - Series.str.rsplit
 - DataFrame.pivot
 - DataFrame.pivot_table
 - len(GroupBy) and ngroups
 ** if groupers are all categorical _and_ observed=False or all boolean
 ** Note these two may not actually be equivalent in all cases: 
[https://github.com/pandas-dev/pandas/issues/26326]

  was:
There are several operations that we currently disallow because they produce a 
variable set of columns in the output based on the data (non-deferred-columns). 
However, for some dtypes (categorical, boolean) we can easily enumerate all the 
possible values that will be seen at execution time, so we can predict the 
columns that will be seen.

Note we still can't implement these operations 100% correctly, as pandas will 
typically only create columns for the values that are _observed_, while we'd 
have to create a column for every possible value.

We should allow these operations in these special cases.

Operations in this category:
 - DataFrame.unstack (can work if unstacked level is a categorical or boolean 
column)
 - Series.str.get_dummies
 - Series.str.split
 - Series.str.rsplit
 - DataFrame.pivot
 - DataFrame.pivot_table
 - len(GroupBy) and ngroups
 ** if groupers are all categorical _and_ observed=False or all boolean
 ** Note these two may not actually be equivalent in all cases: 
[https://github.com/pandas-dev/pandas/issues/26326]


> Allow non-deferred column operations on categorical columns
> -----------------------------------------------------------
>
>                 Key: BEAM-12169
>                 URL: https://issues.apache.org/jira/browse/BEAM-12169
>             Project: Beam
>          Issue Type: Improvement
>          Components: dsl-dataframe, sdk-py-core
>            Reporter: Brian Hulette
>            Assignee: Andy Ye
>            Priority: P3
>              Labels: dataframe-api
>          Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> There are several operations that we currently disallow because they produce 
> a variable set of columns in the output based on the data 
> (non-deferred-columns). However, for some dtypes (categorical, boolean) we 
> can easily enumerate all the possible values that will be seen at execution 
> time, so we can predict the columns that will be seen.
> Note we still can't implement these operations 100% correctly, as pandas will 
> typically only create columns for the values that are _observed_, while we'd 
> have to create a column for every possible value.
> We should allow these operations in these special cases.
> Operations in this category:
>  - DataFrame.unstack, Series.unstack (can work if unstacked level is a 
> categorical or boolean column)
>  - Series.str.get_dummies
>  - Series.str.split
>  - Series.str.rsplit
>  - DataFrame.pivot
>  - DataFrame.pivot_table
>  - len(GroupBy) and ngroups
>  ** if groupers are all categorical _and_ observed=False or all boolean
>  ** Note these two may not actually be equivalent in all cases: 
> [https://github.com/pandas-dev/pandas/issues/26326]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (BEAM-12169) Allow non-deferred column operations on categorical columns

Reply via email to