[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...

icexelloss Thu, 19 Oct 2017 11:43:45 -0700

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/19505
  
    @cloud-fan asked:
    "
    what's the difference between transform and group_transform? Seems we don't 
need to care about it both in usage and implementation.
    "
    
    My answer is:
    transform defines a transformation that doesn't reply on grouping 
semantics: for instance, this is a wrong udf definition:
    
    @pandas_udf(DoubleType(), TRANSFORM):
    def foo(v):
         return (v - v.mean() / v.std())
    because the transformation is replying some kind of "grouping semantics", 
otherwise v.mean() and v.std() has no meaning for arbitrary grouping.
    
    Also, catalyst should throw exception for the code example below:
    ```
    @pandas_udf(DoubleType(), GROUP_TRANSFORM):
    def foo(v):
          return (v - v.mean()) / v.std()
    
    # Should throw exception here, it should only take `transform` not 
`group_transform` type
    df = df.withColumn(foo(df.v))
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...

Reply via email to