[
https://issues.apache.org/jira/browse/BEAM-13605?focusedWorklogId=719091&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-719091
]
ASF GitHub Bot logged work on BEAM-13605:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 01/Feb/22 23:54
Start Date: 01/Feb/22 23:54
Worklog Time Spent: 10m
Work Description: yeandy commented on a change in pull request #16590:
URL: https://github.com/apache/beam/pull/16590#discussion_r797161422
##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -4120,6 +4127,23 @@ def dtypes(self):
grouping_columns = self._grouping_columns
return self.apply(lambda df: df.drop(grouping_columns, axis=1).dtypes)
+ if hasattr(DataFrameGroupBy, 'value_counts'):
+ @frame_base.with_docs_from(DataFrameGroupBy)
+ def value_counts(self, subset=None, sort=False, normalize=False,
+ ascending=False, dropna=True):
+ return frame_base.DeferredFrame.wrap(
+ expressions.ComputedExpression(
+ 'value_counts',
+ lambda df: df.value_counts(
+ subset=subset,
+ sort=sort,
+ normalize=normalize,
+ ascending=ascending,
+ dropna=True), [self._expr],
+ preserves_partition_by=partitionings.Arbitrary(),
+ requires_partition_by=partitionings.Arbitrary())
+ )
Review comment:
It looks like it's a bit more complicated than just passing the original
parent frame, and calling something like `self._parent.value_counts(**kwargs)`.
# Example
## When `normalize=False`
`df.value_counts()` results in
```
gender education country
male low FR 2
female high FR 1
US 1
male low US 1
medium FR 1
```
and `df.groupby('gender').value_counts()` result in
```
gender education country
female high FR 1
US 1
male low FR 2
US 1
medium FR 1
```
The outcomes are "equivalent".
## When `normalize=True`
`df.value_counts(normalize=True)` results in
```
# normalization occurs across all the data
gender education country
male low FR 0.333333
female high FR 0.166667
US 0.166667
male low US 0.166667
medium FR 0.166667
```
and `df.groupby('gender').value_counts(normalize=True)` results in
```
# normalization occurs within the groups
gender education country
female high FR 0.50
US 0.50
male low FR 0.50
US 0.25
medium FR 0.25
```
In the former, normalization occurs across all the data, but in the latter
the normalization occurs within the groups.
<br>
# Conclusion
I looked at the pandas
[implementation](https://github.com/pandas-dev/pandas/blob/v1.4.0/pandas/core/groupby/generic.py#L1575-L1760)
of `DataFrameGroupBy.value_counts()` is done not by calling
`DataFrame.value_counts()`. It's a bit complex. I'm not sure if it's worth
replicating the equivalent logic using the newly created `self._parent`
attribute, or just call it a day with this current solution. If we can use this
current solution, I can still add the `self._parent` attribute for future
operations if necessary.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 719091)
Time Spent: 3h 50m (was: 3h 40m)
> Support pandas 1.4.0 in the DataFrame API
> -----------------------------------------
>
> Key: BEAM-13605
> URL: https://issues.apache.org/jira/browse/BEAM-13605
> Project: Beam
> Issue Type: Improvement
> Components: dsl-dataframe
> Reporter: Brian Hulette
> Assignee: Andy Ye
> Priority: P2
> Time Spent: 3h 50m
> Remaining Estimate: 0h
>
> 1.4.0rc1 is out now, we should verify it works with the DataFrame API, then
> increase the version range to allow it.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)