Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/19505
@cloud-fan asked:
"
what's the difference between transform and group_transform? Seems we don't
need to care about it both in usage and implementation.
"
My answer is:
transform defines a transformation that doesn't reply on grouping
semantics: for instance, this is a wrong udf definition:
@pandas_udf(DoubleType(), TRANSFORM):
def foo(v):
return (v - v.mean() / v.std())
because the transformation is replying some kind of "grouping semantics",
otherwise v.mean() and v.std() has no meaning for arbitrary grouping.
Also, catalyst should throw exception for the code example below:
```
@pandas_udf(DoubleType(), GROUP_TRANSFORM):
def foo(v):
return (v - v.mean()) / v.std()
# Should throw exception here, it should only take `transform` not
`group_transform` type
df = df.withColumn(foo(df.v))
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]