[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...

icexelloss Thu, 19 Oct 2017 11:40:24 -0700

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/19505
  
    Here is a summary of the current proposal during some offline disuccsion:
    
    I. Use only `pandas_udf`
    --------------------------
    The main issues with this approach as a few people comment out is that it 
is hard to know what the udf does without look at the implementation.
    For instance, for a udf:
    ```
    @pandas_udf(DoubleType())
    def foo(v):
          ...
    ``` 
    It's hard to tell whether this function is a reduction that returns a 
scalar double, or a transform function that returns a pd.Series of double.
    
    This is less than ideal because:
    * The user of the udf cannot tell which functions this udf can be used 
with. i.e, can this be used with `groupby().apply()` or `withColumn` or 
`groupby().agg()`?
    * Catalyst cannot do validation at planning phase, i.e., it cannot throw 
exception if user passes a transformation function rather than aggregation 
function to `groupby().agg()`
    
    II. Use different decorators. i,e, `pandas_udf` (or `pandas_scalar_udf`), 
`pandas_grouped_udf`, `pandas_udaf`
    
----------------------------------------------------------------------------------------------------------------
    The idea of this approach is to use `pandas_grouped_udf` for all group 
udfs, and `pandas_scalar_udf` for scalar pandas udfs that gets used with 
"withColumn". This helps with distinguish between some scalar udf and group 
udfs. However, this approach doesn't help to distinguish among group udfs. For 
instance, the group transform and group aggregation examples above.
     
    III. Use `pandas_udf` decorate and a function type enum for "one-step" 
vectorized udf and `pandas_udaf` for multi-step aggregation function
    ----------------------------------------------------------
    This approach uses a function type enum to describe what the udf does. Here 
are the proposed function types:
    * transform
    A pd.Series(s) -> pd.Series transformation that is independent of the 
grouping. This is the existing scalar pandas udf.
    * group_transform
    A pd.Series(s) -> pd.Series transformation that is dependent of the 
grouping. e.g.
    ```
    @pandas_udf(DoubleType(), GROUP_TRANSFORM):
    def foo(v):
          return (v - v.mean()) / v.std()
    ```
    * group_aggregate:
    A pd.Series(s) -> scalar function, e.g.
    ```
    @pandas_udf(DoubleType(), GROUP_AGGREGATE):
    def foo(v):
          return v.mean()
    ```
    * group_map (maybe a better name):
    This defines a pd.DataFrame -> pd.DataFrame transformation. This is the 
current `groupby().apply()` udf
    
    These types also works with window functions because window functions are 
either (1) group_transform (rank) or (2) group_aggregate (first, last)
    
    I am in favor of (3). What do you guys think?




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...

Reply via email to