[jira] [Work logged] (BEAM-13605) Support pandas 1.4.0 in the DataFrame API

ASF GitHub Bot (Jira) Tue, 01 Feb 2022 15:55:04 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-13605?focusedWorklogId=719091&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-719091
 ]


ASF GitHub Bot logged work on BEAM-13605:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 01/Feb/22 23:54
            Start Date: 01/Feb/22 23:54
    Worklog Time Spent: 10m 
      Work Description: yeandy commented on a change in pull request #16590:
URL: https://github.com/apache/beam/pull/16590#discussion_r797161422



##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -4120,6 +4127,23 @@ def dtypes(self):
     grouping_columns = self._grouping_columns
     return self.apply(lambda df: df.drop(grouping_columns, axis=1).dtypes)
 
+  if hasattr(DataFrameGroupBy, 'value_counts'):
+    @frame_base.with_docs_from(DataFrameGroupBy)
+    def value_counts(self, subset=None, sort=False, normalize=False,
+                      ascending=False, dropna=True):
+      return frame_base.DeferredFrame.wrap(
+          expressions.ComputedExpression(
+              'value_counts',
+              lambda df: df.value_counts(
+                subset=subset,
+                sort=sort,
+                normalize=normalize,
+                ascending=ascending,
+                dropna=True), [self._expr],
+              preserves_partition_by=partitionings.Arbitrary(),
+              requires_partition_by=partitionings.Arbitrary())
+      )

Review comment:
       It looks like it's a bit more complicated than just passing the original 
parent frame, and calling something like `self._parent.value_counts(**kwargs)`. 
   
   # Example
   ## When `normalize=False`
   `df.value_counts()` results in
   ```
   gender  education  country
   male    low        FR         2
   female  high       FR         1
                      US         1
   male    low        US         1
           medium     FR         1
   
   ```
   and `df.groupby('gender').value_counts()` result in
   ```
   gender  education  country
   female  high       FR         1
                      US         1
   male    low        FR         2
                      US         1
           medium     FR         1
   ```
   The outcomes are "equivalent".
   
   ## When `normalize=True`
   `df.value_counts(normalize=True)` results in
   ```
   # normalization occurs across all the data
   gender  education  country
   male    low        FR         0.333333
   female  high       FR         0.166667
                      US         0.166667
   male    low        US         0.166667
           medium     FR         0.166667
   ``` 
   and `df.groupby('gender').value_counts(normalize=True)` results in 
   ```
   # normalization occurs within the groups
   gender  education  country
   female  high       FR         0.50
                      US         0.50
   male    low        FR         0.50
                      US         0.25
           medium     FR         0.25
   ```
   In the former, normalization occurs across all the data, but in the latter 
the normalization occurs within the groups.
   <br>
   
   # Conclusion
   I looked at the pandas 
[implementation](https://github.com/pandas-dev/pandas/blob/v1.4.0/pandas/core/groupby/generic.py#L1575-L1760)
 of  `DataFrameGroupBy.value_counts()` is done not by calling 
`DataFrame.value_counts()`. It's a bit complex. I'm not sure if it's worth 
replicating the equivalent logic using the newly created `self._parent` 
attribute, or just call it a day with this current solution. If we can use this 
current solution, I can still add the `self._parent` attribute for future 
operations if necessary.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 719091)
    Time Spent: 3h 50m  (was: 3h 40m)

> Support pandas 1.4.0 in the DataFrame API
> -----------------------------------------
>
>                 Key: BEAM-13605
>                 URL: https://issues.apache.org/jira/browse/BEAM-13605
>             Project: Beam
>          Issue Type: Improvement
>          Components: dsl-dataframe
>            Reporter: Brian Hulette
>            Assignee: Andy Ye
>            Priority: P2
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> 1.4.0rc1 is out now, we should verify it works with the DataFrame API, then 
> increase the version range to allow it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (BEAM-13605) Support pandas 1.4.0 in the DataFrame API

Reply via email to