Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/18732
@cloud-fan Thanks for your feedback.
I think it makes sense to define `pandas_udaf` as it's own function because
it is a multi-step udf and is very different from the existing `pandas_udf`
I also agree we shouldn't add many parameters as flag. However, here are
something I am not sure about:
* Use different function name (i.e, `pandas_udf` and `pandas_grouped_udf`)
for different input/output type (pd.Series, pd.DataFrame, or Scalar):
There could be potentially many combination of input/output types that
doesn't fit into `pd.Series -> pd.Series` and `pd.DataFrame -> pd.DataFrame`.
For instance, for a vectorized window function and aggregation, it would
probably be a `pd.Series -> scalar` function, which is different from either
`pandas_udf` or `pandas_grouped_udf`, I am not sure if we want to introduce
another decorator for those cases.
* Distinguish between a Struct column and a DataFrame:
I think we can accomplish this without introducing a new function decorator
or a parameter, but by the context how `pandas_udf` is used. For instance,
`withColumn` add a column, so the return type specifies the column type, where
`groupby().apply()` maps a DataFrame, so the return type specifies dataframe
schema.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]