[GitHub] [beam] yeandy commented on a change in pull request #16590: [BEAM-13605] Update pandas_doctests_test denylists in preparation for pandas 1.4.0

GitBox Tue, 01 Feb 2022 15:54:21 -0800


yeandy commented on a change in pull request #16590:
URL: https://github.com/apache/beam/pull/16590#discussion_r797161422




##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -4120,6 +4127,23 @@ def dtypes(self):
     grouping_columns = self._grouping_columns
     return self.apply(lambda df: df.drop(grouping_columns, axis=1).dtypes)
 
+  if hasattr(DataFrameGroupBy, 'value_counts'):
+    @frame_base.with_docs_from(DataFrameGroupBy)
+    def value_counts(self, subset=None, sort=False, normalize=False,
+                      ascending=False, dropna=True):
+      return frame_base.DeferredFrame.wrap(
+          expressions.ComputedExpression(
+              'value_counts',
+              lambda df: df.value_counts(
+                subset=subset,
+                sort=sort,
+                normalize=normalize,
+                ascending=ascending,
+                dropna=True), [self._expr],
+              preserves_partition_by=partitionings.Arbitrary(),
+              requires_partition_by=partitionings.Arbitrary())
+      )

Review comment:
       It looks like it's a bit more complicated than just passing the original 
parent frame, and calling something like `self._parent.value_counts(**kwargs)`. 
   
   # Example
   ## When `normalize=False`
   `df.value_counts()` results in
   ```
   gender  education  country
   male    low        FR         2
   female  high       FR         1
                      US         1
   male    low        US         1
           medium     FR         1
   
   ```
   and `df.groupby('gender').value_counts()` result in
   ```
   gender  education  country
   female  high       FR         1
                      US         1
   male    low        FR         2
                      US         1
           medium     FR         1
   ```
   The outcomes are "equivalent".
   
   ## When `normalize=True`
   `df.value_counts(normalize=True)` results in
   ```
   # normalization occurs across all the data
   gender  education  country
   male    low        FR         0.333333
   female  high       FR         0.166667
                      US         0.166667
   male    low        US         0.166667
           medium     FR         0.166667
   ``` 
   and `df.groupby('gender').value_counts(normalize=True)` results in 
   ```
   # normalization occurs within the groups
   gender  education  country
   female  high       FR         0.50
                      US         0.50
   male    low        FR         0.50
                      US         0.25
           medium     FR         0.25
   ```
   In the former, normalization occurs across all the data, but in the latter 
the normalization occurs within the groups.
   <br>
   
   # Conclusion
   I looked at the pandas 
[implementation](https://github.com/pandas-dev/pandas/blob/v1.4.0/pandas/core/groupby/generic.py#L1575-L1760)
 of  `DataFrameGroupBy.value_counts()` is done not by calling 
`DataFrame.value_counts()`. It's a bit complex. I'm not sure if it's worth 
replicating the equivalent logic using the newly created `self._parent` 
attribute, or just call it a day with this current solution. If we can use this 
current solution, I can still add the `self._parent` attribute for future 
operations if necessary.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] yeandy commented on a change in pull request #16590: [BEAM-13605] Update pandas_doctests_test denylists in preparation for pandas 1.4.0

Reply via email to