[ 
https://issues.apache.org/jira/browse/BEAM-12560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435636#comment-17435636
 ] 

Brian Hulette commented on BEAM-12560:
--------------------------------------

Hi [~Mike Hernandez]  sorry it took a while for me to get back to you on this.

The "proxy" is an empty DataFrame that we associate with every expression in 
the DataFrame expression tree. The idea is that it will have the same "shape" 
(that is, the same columns and indexes, with the same names and types) as the 
dataframe that this expression will produce when it's executed, except it's 
completely empty. The proxy is really useful because it lets us know what 
operations are valid so we can catch errors before the job starts running (e.g. 
say we know that df has columns foo and bar, df.foo is valid, but df.dog is 
not. If the user does the latter, we will raise an error early.)

Now, usually, we can generate these proxy object automatically, which is what's 
happening on the line you accessed. We just execute the expression's function, 
and pass in the proxy object(s) for the input(s). Unfortunately there are some 
cases where this doesn't work because pandas won't let you do certain 
operations on an empty dataset. In these cases we have to construct the proxy 
manually and pass it to the expression with the {{proxy=}} argument. There are 
a few examples of this already, like 
[DataFrame.rename|https://github.com/apache/beam/blob/cd4b7f3b1af4f51bdab1a0b1a98f94b5288c09ec/sdks/python/apache_beam/dataframe/frames.py#L3237].

So for the idxmin/idxmax case, it looks like you'll need to generate a proxy 
(an empty Series) with an index that matches the type of the input's columns, 
and a column with the same dtype as the input's index.

> Implement idxmin and idxmax for DataFrame, Series, and GroupBy
> --------------------------------------------------------------
>
>                 Key: BEAM-12560
>                 URL: https://issues.apache.org/jira/browse/BEAM-12560
>             Project: Beam
>          Issue Type: Improvement
>          Components: dsl-dataframe
>            Reporter: Brian Hulette
>            Assignee: Rogelio Miguel Hernandez Sandoval
>            Priority: P3
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Add an implementation of 
> [idxmin|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmin.html]
>  and 
> [idxmax|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmax.html]
>  for DeferredDataFrame, DeferredSeries, and DeferredGroupBy. It should be 
> fully unit tested with some combination of pandas_doctests_test.py and 
> frames_test.py.
> https://github.com/apache/beam/pull/14274 is an example of a typical PR that 
> adds new operations. See 
> https://lists.apache.org/thread.html/r8ffe96d756054610dfdb4e849ffc6a741e826d15ba7e5bdeee1b4266%40%3Cdev.beam.apache.org%3E
>  for background on the DataFrame API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to