Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/19505
Here is a summary of the current proposal during some offline disuccsion:
I. Use only `pandas_udf`
--------------------------
The main issues with this approach as a few people comment out is that it
is hard to know what the udf does without look at the implementation.
For instance, for a udf:
```
@pandas_udf(DoubleType())
def foo(v):
...
```
It's hard to tell whether this function is a reduction that returns a
scalar double, or a transform function that returns a pd.Series of double.
This is less than ideal because:
* The user of the udf cannot tell which functions this udf can be used
with. i.e, can this be used with `groupby().apply()` or `withColumn` or
`groupby().agg()`?
* Catalyst cannot do validation at planning phase, i.e., it cannot throw
exception if user passes a transformation function rather than aggregation
function to `groupby().agg()`
II. Use different decorators. i,e, `pandas_udf` (or `pandas_scalar_udf`),
`pandas_grouped_udf`, `pandas_udaf`
----------------------------------------------------------------------------------------------------------------
The idea of this approach is to use `pandas_grouped_udf` for all group
udfs, and `pandas_scalar_udf` for scalar pandas udfs that gets used with
"withColumn". This helps with distinguish between some scalar udf and group
udfs. However, this approach doesn't help to distinguish among group udfs. For
instance, the group transform and group aggregation examples above.
III. Use `pandas_udf` decorate and a function type enum for "one-step"
vectorized udf and `pandas_udaf` for multi-step aggregation function
----------------------------------------------------------
This approach uses a function type enum to describe what the udf does. Here
are the proposed function types:
* transform
A pd.Series(s) -> pd.Series transformation that is independent of the
grouping. This is the existing scalar pandas udf.
* group_transform
A pd.Series(s) -> pd.Series transformation that is dependent of the
grouping. e.g.
```
@pandas_udf(DoubleType(), GROUP_TRANSFORM):
def foo(v):
return (v - v.mean()) / v.std()
```
* group_aggregate:
A pd.Series(s) -> scalar function, e.g.
```
@pandas_udf(DoubleType(), GROUP_AGGREGATE):
def foo(v):
return v.mean()
```
* group_map (maybe a better name):
This defines a pd.DataFrame -> pd.DataFrame transformation. This is the
current `groupby().apply()` udf
These types also works with window functions because window functions are
either (1) group_transform (rank) or (2) group_aggregate (first, last)
I am in favor of (3). What do you guys think?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]