TheNeuralBit commented on a change in pull request #14992:
URL: https://github.com/apache/beam/pull/14992#discussion_r651215931
##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -253,6 +253,36 @@ def fillna(self, value, method, axis, limit, **kwargs):
backfill = _fillna_alias('backfill')
pad = _fillna_alias('pad')
+ @frame_base.with_docs_from(pd.DataFrame)
+ def first(self, offset):
+ per_partition = expressions.ComputedExpression(
+ 'first-per-partition',
+ lambda df: df.sort_index().first(offset=offset), [self._expr],
+ preserves_partition_by=partitionings.Arbitrary(),
+ requires_partition_by=partitionings.Arbitrary())
+ with expressions.allow_non_parallel_operations(True):
+ return frame_base.DeferredFrame.wrap(
+ expressions.ComputedExpression(
+ 'first',
+ lambda df: df.sort_index().first(offset=offset), [per_partition],
+ preserves_partition_by=partitionings.Arbitrary(),
+ requires_partition_by=partitionings.Singleton()))
Review comment:
Yep!
##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -3037,10 +3059,8 @@ def do_partition_apply(df):
tail = frame_base.wont_implement_method(
DataFrameGroupBy, 'tail', explanation=_PEEK_METHOD_EXPLANATION)
- first = frame_base.wont_implement_method(
- DataFrameGroupBy, 'first', reason='order-sensitive')
- last = frame_base.wont_implement_method(
- DataFrameGroupBy, 'last', reason='order-sensitive')
+ first = frame_base.not_implemented_method('first')
Review comment:
`not_implemented_method` indicates a method that's not implemented just
because we haven't gotten to it yet, it will `raise NotImplementedError(..)` if
used. `wont_implement_method` is for operations that aren't implemented because
of some structural issue (like being sensitive to order, or producing an output
schema that we can't determine at construction time). The latter raises an
error that will point to documentation about that type of limitation
(BEAM-12029 for the error messages, BEAM-11951 is for the documentation,
that's still in progress).
"Wont implement" is a little strong, since in practice we may still
implement some of those in the future. But the barrier for those is higher.
##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -3210,17 +3230,15 @@ class _DeferredGroupByCols(frame_base.DeferredFrame):
diff = frame_base._elementwise_method('diff', base=DataFrameGroupBy)
fillna = frame_base._elementwise_method('fillna', base=DataFrameGroupBy)
filter = frame_base._elementwise_method('filter', base=DataFrameGroupBy)
- first = frame_base.wont_implement_method(
- DataFrameGroupBy, 'first', reason="order-sensitive")
+ first = frame_base._elementwise_method('first', base=DataFrameGroupBy)
Review comment:
This is a weird quirk of our implementation. In pandas when you
groupby() a DataFrame you can change the "axis" you want to group/aggregate
across. The default is the intuitive axis="index", where each column is
grouped/aggregated across all of the rows of the dataset.
But users can also specify they want to groupby(axis="columns"), in which
case each _row_ will be grouped/aggregated across the columns. This class,
`_DeferredGroupByCols`. is just handling that `axis="columns"` case.
Technically we can easily support most of these aggregations since they're
just performing an operation on each element, but it's not clear this path
actually gets much usage.
##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -253,6 +253,36 @@ def fillna(self, value, method, axis, limit, **kwargs):
backfill = _fillna_alias('backfill')
pad = _fillna_alias('pad')
+ @frame_base.with_docs_from(pd.DataFrame)
+ def first(self, offset):
+ per_partition = expressions.ComputedExpression(
+ 'first-per-partition',
+ lambda df: df.sort_index().first(offset=offset), [self._expr],
+ preserves_partition_by=partitionings.Arbitrary(),
Review comment:
This actually means it will preserve any partitioning,
`preserves=Singleton()` would indicate it preserves no partitioning.
In this case the operation doesn't modify the index at all, so the output
should still be partitioned in the same way.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]