yeandy commented on a change in pull request #16615:
URL: https://github.com/apache/beam/pull/16615#discussion_r801597567



##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -4625,9 +4625,48 @@ def repeat(self, repeats):
       raise TypeError("str.repeat(repeats=) value must be an int or a "
                       f"DeferredSeries (encountered {type(repeats)}).")
 
-  get_dummies = frame_base.wont_implement_method(
-      pd.core.strings.StringMethods, 'get_dummies',
-      reason='non-deferred-columns')
+  @frame_base.with_docs_from(pd.core.strings.StringMethods)
+  @frame_base.args_to_kwargs(pd.core.strings.StringMethods)
+  def get_dummies(self, **kwargs):
+    """
+    Series must be categorical type. Either cast to ``category`` to
+    infer categories, or preferred, cast to ``CategoricalDtype``
+    to ensure correct categories.

Review comment:
       Ack on 1. 
   
   And for 2, when I was referring to "cast to `category`", I meant that one 
could feasibly be doing the following
   ```
   >>> s = pd.Series(['a', 'b', 'a', 'c', 'a', np.nan]).astype('category')
   >>> s
   0      a
   1      b
   2      a
   3      c
   4      a
   5    NaN
   dtype: category
   Categories (3, object): ['a', 'b', 'c']
   >>> s.dtype
   CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
   ```
   resulting in the inferred categorical types. This obviously can be fragile, 
but nevertheless technically works, and results in the `dtype` to be 
`CategoricalDtype`. If we explicitly define with `CategoricalDtype`, then we 
may get:
   ```
   >>> s = pd.Series(['a', 'b', 'a', 'c', 'a', np.nan], 
dtype=pd.CategoricalDtype(categories=['b', 'a']))
   >>> s
   0      a
   1      b
   2      a
   3    NaN
   4      a
   5    NaN
   dtype: category
   Categories (2, object): ['b', 'a']
   >>> s.dtype
   CategoricalDtype(categories=['b', 'a'], ordered=False)
   ```
   Note that in the second example, I didn't pass in the category `c`, which 
will result in the unknown category (`NaN`) in row 3.  The onus for getting 
these categories correct would be on the user.
   
   I can definitely I can add a dedicated documentation section about this.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to