TheNeuralBit commented on a change in pull request #16590:
URL: https://github.com/apache/beam/pull/16590#discussion_r796186286



##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -4120,6 +4127,23 @@ def dtypes(self):
     grouping_columns = self._grouping_columns
     return self.apply(lambda df: df.drop(grouping_columns, axis=1).dtypes)
 
+  if hasattr(DataFrameGroupBy, 'value_counts'):
+    @frame_base.with_docs_from(DataFrameGroupBy)
+    def value_counts(self, subset=None, sort=False, normalize=False,
+                      ascending=False, dropna=True):

Review comment:
       A couple of things here: 
   - Since we don't actually need to observe the values for any of the args, 
you could just define the function as `def value_counts(self, **kwargs):` and 
pass through the keyword args. That way we won't need to duplicate the default 
values.
   - We should probably add a docstring saying that this is basically the same 
as `DataFrame.value_counts()`. In normal pandas the only difference between 
these two seems to be the order of the result, but we make no ordering 
guarantees. Note that `with_docs_from` copies the docstring from pandas, and 
any docstring that we define here is added to a "Differences from pandas" 
section, see for example 
https://beam.apache.org/releases/pydoc/2.35.0/apache_beam.dataframe.frames.html#apache_beam.dataframe.frames.DeferredSeries.append

##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -4120,6 +4127,23 @@ def dtypes(self):
     grouping_columns = self._grouping_columns
     return self.apply(lambda df: df.drop(grouping_columns, axis=1).dtypes)
 
+  if hasattr(DataFrameGroupBy, 'value_counts'):
+    @frame_base.with_docs_from(DataFrameGroupBy)
+    def value_counts(self, subset=None, sort=False, normalize=False,
+                      ascending=False, dropna=True):
+      return frame_base.DeferredFrame.wrap(
+          expressions.ComputedExpression(
+              'value_counts',
+              lambda df: df.value_counts(
+                subset=subset,
+                sort=sort,
+                normalize=normalize,
+                ascending=ascending,
+                dropna=True), [self._expr],
+              preserves_partition_by=partitionings.Arbitrary(),
+              requires_partition_by=partitionings.Arbitrary())
+      )

Review comment:
       It might be better to call `value_counts` on the original frame object, 
since that could partition the dataset more in some cases (e.g. if we're 
counting by more columns than the ones specified in the groupby), but it looks 
like the GroupBy object doesn't have access to the original frame, so we'd have 
to plumb that through.

##########
File path: sdks/python/apache_beam/dataframe/pandas_doctests_test.py
##########
@@ -121,11 +122,12 @@ def test_ndframe_tests(self):
             'pandas.core.generic.NDFrame.convert_dtypes': ['*'],
             'pandas.core.generic.NDFrame.copy': ['*'],
             'pandas.core.generic.NDFrame.droplevel': ['*'],
+            'pandas.core.generic.NDFrame.get': ['*'],
             'pandas.core.generic.NDFrame.rank': [
                 # Modified dataframe
                 'df'
             ],
-            'pandas.core.generic.NDFrame.rename': [
+            'pandas.core.generic.NDFrame._rename': [

Review comment:
       hehe




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to