TheNeuralBit commented on a change in pull request #14517:
URL: https://github.com/apache/beam/pull/14517#discussion_r613688586
##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -1393,9 +1430,16 @@ def fill_dataframe(*args):
- cummax = cummin = cumsum = cumprod = frame_base.wont_implement_method(
- 'order-sensitive')
- diff = frame_base.wont_implement_method('order-sensitive')
+ cummax = frame_base.order_sensitive_method(pd.DataFrame, 'cummax')
+ cummin = frame_base.order_sensitive_method(pd.DataFrame, 'cummin')
+ cumprod = frame_base.order_sensitive_method(pd.DataFrame, 'cumprod')
+ cumsum = frame_base.order_sensitive_method(pd.DataFrame, 'cumsum')
+ diff = frame_base.order_sensitive_method(pd.DataFrame, 'diff')
Review comment:
Ahh I see. Yeah I think you're right we could do that and it wouldn't be
too expensive. The problem there is it would be a divergence from pandas, which
always ignores the index and computes the diff based on the order of the data.
That feels like a nitpicky argument, but it could be surprising for someone
very familiar with pandas, so we've opted to just disallow cases like this.
There are some cases where we've added Beam-only arguments that make
operations feasible in Beam (e.g. `nlargest(keep='any')`,
`unique(as_series=True)`). We could consider that here, something like
`diff(by_index=True)`.
I filed BEAM-12171 to track this idea, thanks :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]