yeandy commented on a change in pull request #16590:
URL: https://github.com/apache/beam/pull/16590#discussion_r797161422
##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -4120,6 +4127,23 @@ def dtypes(self):
grouping_columns = self._grouping_columns
return self.apply(lambda df: df.drop(grouping_columns, axis=1).dtypes)
+ if hasattr(DataFrameGroupBy, 'value_counts'):
+ @frame_base.with_docs_from(DataFrameGroupBy)
+ def value_counts(self, subset=None, sort=False, normalize=False,
+ ascending=False, dropna=True):
+ return frame_base.DeferredFrame.wrap(
+ expressions.ComputedExpression(
+ 'value_counts',
+ lambda df: df.value_counts(
+ subset=subset,
+ sort=sort,
+ normalize=normalize,
+ ascending=ascending,
+ dropna=True), [self._expr],
+ preserves_partition_by=partitionings.Arbitrary(),
+ requires_partition_by=partitionings.Arbitrary())
+ )
Review comment:
It looks like it's a bit more complicated than just passing the original
parent frame, and calling something like `self._parent.value_counts(**kwargs)`.
# Example
## When `normalize=False`
`df.value_counts()` results in
```
gender education country
male low FR 2
female high FR 1
US 1
male low US 1
medium FR 1
```
and `df.groupby('gender').value_counts()` result in
```
gender education country
female high FR 1
US 1
male low FR 2
US 1
medium FR 1
```
The outcomes are "equivalent".
## When `normalize=True`
`df.value_counts(normalize=True)` results in
```
# normalization occurs across all the data
gender education country
male low FR 0.333333
female high FR 0.166667
US 0.166667
male low US 0.166667
medium FR 0.166667
```
and `df.groupby('gender').value_counts(normalize=True)` results in
```
# normalization occurs within the groups
gender education country
female high FR 0.50
US 0.50
male low FR 0.50
US 0.25
medium FR 0.25
```
In the former, normalization occurs across all the data, but in the latter
the normalization occurs within the groups.
<br>
# Conclusion
I looked at the pandas
[implementation](https://github.com/pandas-dev/pandas/blob/v1.4.0/pandas/core/groupby/generic.py#L1575-L1760)
of `DataFrameGroupBy.value_counts()` is done not by calling
`DataFrame.value_counts()`. It's a bit complex. I'm not sure if it's worth
replicating the equivalent logic using the newly created `self._parent`
attribute, or just call it a day with this current solution. If we can use this
current solution, I can still add the `self._parent` attribute for future
operations if necessary.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]