[GitHub] [beam] TheNeuralBit commented on a change in pull request #14992: [BEAM-9547] Add implementation for first and last

GitBox Mon, 14 Jun 2021 12:38:27 -0700


TheNeuralBit commented on a change in pull request #14992:
URL: https://github.com/apache/beam/pull/14992#discussion_r651215931




##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -253,6 +253,36 @@ def fillna(self, value, method, axis, limit, **kwargs):
   backfill = _fillna_alias('backfill')
   pad = _fillna_alias('pad')
 
+  @frame_base.with_docs_from(pd.DataFrame)
+  def first(self, offset):
+    per_partition = expressions.ComputedExpression(
+        'first-per-partition',
+        lambda df: df.sort_index().first(offset=offset), [self._expr],
+        preserves_partition_by=partitionings.Arbitrary(),
+        requires_partition_by=partitionings.Arbitrary())
+    with expressions.allow_non_parallel_operations(True):
+      return frame_base.DeferredFrame.wrap(
+          expressions.ComputedExpression(
+              'first',
+              lambda df: df.sort_index().first(offset=offset), [per_partition],
+              preserves_partition_by=partitionings.Arbitrary(),
+              requires_partition_by=partitionings.Singleton()))

Review comment:
       Yep!

##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -3037,10 +3059,8 @@ def do_partition_apply(df):
   tail = frame_base.wont_implement_method(
       DataFrameGroupBy, 'tail', explanation=_PEEK_METHOD_EXPLANATION)
 
-  first = frame_base.wont_implement_method(
-      DataFrameGroupBy, 'first', reason='order-sensitive')
-  last = frame_base.wont_implement_method(
-      DataFrameGroupBy, 'last', reason='order-sensitive')
+  first = frame_base.not_implemented_method('first')

Review comment:
       `not_implemented_method` indicates a method that's not implemented just 
because we haven't gotten to it yet, it will `raise NotImplementedError(..)` if 
used. `wont_implement_method` is for operations that aren't implemented because 
of some structural issue (like being sensitive to order, or producing an output 
schema that we can't determine at construction time). The latter raises an 
error that will point to documentation about that type of limitation 
(BEAM-12029 for the error messages,  BEAM-11951 is for the documentation, 
that's still in progress). 
   
   "Wont implement" is a little strong, since in practice we may still 
implement some of those in the future. But the barrier for those is higher.

##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -3210,17 +3230,15 @@ class _DeferredGroupByCols(frame_base.DeferredFrame):
   diff = frame_base._elementwise_method('diff', base=DataFrameGroupBy)
   fillna = frame_base._elementwise_method('fillna', base=DataFrameGroupBy)
   filter = frame_base._elementwise_method('filter', base=DataFrameGroupBy)
-  first = frame_base.wont_implement_method(
-      DataFrameGroupBy, 'first', reason="order-sensitive")
+  first = frame_base._elementwise_method('first', base=DataFrameGroupBy)

Review comment:
       This is a weird quirk of our implementation. In pandas when you 
groupby() a DataFrame you can change the "axis" you want to group/aggregate 
across. The default is the intuitive axis="index", where each column is 
grouped/aggregated across all of the rows of the dataset.
   
   But users can also specify they want to groupby(axis="columns"), in which 
case each _row_ will be grouped/aggregated across the columns. This class, 
`_DeferredGroupByCols`. is just handling that `axis="columns"` case.
   
   Technically we can easily support most of these aggregations since they're 
just performing an operation on each element, but it's not clear this path 
actually gets much usage.

##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -253,6 +253,36 @@ def fillna(self, value, method, axis, limit, **kwargs):
   backfill = _fillna_alias('backfill')
   pad = _fillna_alias('pad')
 
+  @frame_base.with_docs_from(pd.DataFrame)
+  def first(self, offset):
+    per_partition = expressions.ComputedExpression(
+        'first-per-partition',
+        lambda df: df.sort_index().first(offset=offset), [self._expr],
+        preserves_partition_by=partitionings.Arbitrary(),

Review comment:
       This actually means it will preserve any partitioning, 
`preserves=Singleton()` would indicate it preserves no partitioning.
   
   In this case the operation doesn't modify the index at all, so the output 
should still be partitioned in the same way.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] TheNeuralBit commented on a change in pull request #14992: [BEAM-9547] Add implementation for first and last

Reply via email to