[jira] [Work logged] (BEAM-9547) Implement all pandas operations (or raise WontImplementError)

ASF GitHub Bot (Jira) Mon, 14 Jun 2021 12:39:06 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-9547?focusedWorklogId=610872&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-610872
 ]


ASF GitHub Bot logged work on BEAM-9547:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 14/Jun/21 19:38
            Start Date: 14/Jun/21 19:38
    Worklog Time Spent: 10m 
      Work Description: TheNeuralBit commented on a change in pull request 
#14992:
URL: https://github.com/apache/beam/pull/14992#discussion_r651215931



##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -253,6 +253,36 @@ def fillna(self, value, method, axis, limit, **kwargs):
   backfill = _fillna_alias('backfill')
   pad = _fillna_alias('pad')
 
+  @frame_base.with_docs_from(pd.DataFrame)
+  def first(self, offset):
+    per_partition = expressions.ComputedExpression(
+        'first-per-partition',
+        lambda df: df.sort_index().first(offset=offset), [self._expr],
+        preserves_partition_by=partitionings.Arbitrary(),
+        requires_partition_by=partitionings.Arbitrary())
+    with expressions.allow_non_parallel_operations(True):
+      return frame_base.DeferredFrame.wrap(
+          expressions.ComputedExpression(
+              'first',
+              lambda df: df.sort_index().first(offset=offset), [per_partition],
+              preserves_partition_by=partitionings.Arbitrary(),
+              requires_partition_by=partitionings.Singleton()))

Review comment:
       Yep!

##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -3037,10 +3059,8 @@ def do_partition_apply(df):
   tail = frame_base.wont_implement_method(
       DataFrameGroupBy, 'tail', explanation=_PEEK_METHOD_EXPLANATION)
 
-  first = frame_base.wont_implement_method(
-      DataFrameGroupBy, 'first', reason='order-sensitive')
-  last = frame_base.wont_implement_method(
-      DataFrameGroupBy, 'last', reason='order-sensitive')
+  first = frame_base.not_implemented_method('first')

Review comment:
       `not_implemented_method` indicates a method that's not implemented just 
because we haven't gotten to it yet, it will `raise NotImplementedError(..)` if 
used. `wont_implement_method` is for operations that aren't implemented because 
of some structural issue (like being sensitive to order, or producing an output 
schema that we can't determine at construction time). The latter raises an 
error that will point to documentation about that type of limitation 
(BEAM-12029 for the error messages,  BEAM-11951 is for the documentation, 
that's still in progress). 
   
   "Wont implement" is a little strong, since in practice we may still 
implement some of those in the future. But the barrier for those is higher.

##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -3210,17 +3230,15 @@ class _DeferredGroupByCols(frame_base.DeferredFrame):
   diff = frame_base._elementwise_method('diff', base=DataFrameGroupBy)
   fillna = frame_base._elementwise_method('fillna', base=DataFrameGroupBy)
   filter = frame_base._elementwise_method('filter', base=DataFrameGroupBy)
-  first = frame_base.wont_implement_method(
-      DataFrameGroupBy, 'first', reason="order-sensitive")
+  first = frame_base._elementwise_method('first', base=DataFrameGroupBy)

Review comment:
       This is a weird quirk of our implementation. In pandas when you 
groupby() a DataFrame you can change the "axis" you want to group/aggregate 
across. The default is the intuitive axis="index", where each column is 
grouped/aggregated across all of the rows of the dataset.
   
   But users can also specify they want to groupby(axis="columns"), in which 
case each _row_ will be grouped/aggregated across the columns. This class, 
`_DeferredGroupByCols`. is just handling that `axis="columns"` case.
   
   Technically we can easily support most of these aggregations since they're 
just performing an operation on each element, but it's not clear this path 
actually gets much usage.

##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -253,6 +253,36 @@ def fillna(self, value, method, axis, limit, **kwargs):
   backfill = _fillna_alias('backfill')
   pad = _fillna_alias('pad')
 
+  @frame_base.with_docs_from(pd.DataFrame)
+  def first(self, offset):
+    per_partition = expressions.ComputedExpression(
+        'first-per-partition',
+        lambda df: df.sort_index().first(offset=offset), [self._expr],
+        preserves_partition_by=partitionings.Arbitrary(),

Review comment:
       This actually means it will preserve any partitioning, 
`preserves=Singleton()` would indicate it preserves no partitioning.
   
   In this case the operation doesn't modify the index at all, so the output 
should still be partitioned in the same way.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 610872)
    Time Spent: 127h 40m  (was: 127.5h)

> Implement all pandas operations (or raise WontImplementError)
> -------------------------------------------------------------
>
>                 Key: BEAM-9547
>                 URL: https://issues.apache.org/jira/browse/BEAM-9547
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>            Reporter: Brian Hulette
>            Assignee: Robert Bradshaw
>            Priority: P2
>              Labels: dataframe-api
>          Time Spent: 127h 40m
>  Remaining Estimate: 0h
>
> We should have an implementation for every DataFrame, Series, and GroupBy 
> method. Everything that's not possible to implement should get a default 
> implementation that raises WontImplementError
> See https://github.com/apache/beam/pull/10757#discussion_r389132292
> Progress at the individual operation level is tracked in a 
> [spreadsheet|https://docs.google.com/spreadsheets/d/1hHAaJ0n0k2tw465ORs5tfdy4Lg0DnGWIQ53cLjAhel0/edit],
>  consider requesting edit access if you'd like to help out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9547) Implement all pandas operations (or raise WontImplementError)

Reply via email to