[GitHub] [beam] damccorm opened a new issue, #20958: Allow non-deferred column operations on categorical columns

GitBox Sat, 04 Jun 2022 13:44:04 -0700


damccorm opened a new issue, #20958:
URL: https://github.com/apache/beam/issues/20958


   There are several operations that we currently disallow because they produce 
a variable set of columns in the output based on the data 
(non-deferred-columns). However, for some dtypes (categorical, boolean) we can 
easily enumerate all the possible values that will be seen at execution time, 
so we can predict the columns that will be seen.
   
   Note we still can't implement these operations 100% correctly, as pandas 
will typically only create columns for the values that are __observed__, while 
we'd have to create a column for every possible value.
   
   We should allow these operations in these special cases.
   
   Operations in this category:
    - DataFrame.unstack, Series.unstack (can work if unstacked level is a 
categorical or boolean column)
    - Series.str.get_dummies
    - Series.str.split
    - Series.str.rsplit
    - DataFrame.pivot
    - DataFrame.pivot_table
   
   Imported from Jira 
[BEAM-12169](https://issues.apache.org/jira/browse/BEAM-12169). Original Jira 
may contain additional context.
   Reported by: bhulette.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm opened a new issue, #20958: Allow non-deferred column operations on categorical columns

Reply via email to