[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

icexelloss Mon, 16 Oct 2017 09:03:07 -0700

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/18732
  
    @cloud-fan Thanks for your feedback. 
    
    I think it makes sense to define `pandas_udaf` as it's own function because 
it is a multi-step udf and is very different from the existing `pandas_udf`
    
    I also agree we shouldn't add many parameters as flag. However, here are 
something I am not sure about:
    
    * Use different function name (i.e, `pandas_udf` and `pandas_grouped_udf`) 
for different input/output type (pd.Series, pd.DataFrame, or Scalar):
    There could be potentially many combination of input/output types that 
doesn't fit into `pd.Series -> pd.Series` and `pd.DataFrame -> pd.DataFrame`. 
For instance, for a vectorized window function and aggregation, it would 
probably be a `pd.Series -> scalar` function, which is different from either 
`pandas_udf` or `pandas_grouped_udf`, I am not sure if we want to introduce 
another decorator for those cases.
    
    * Distinguish between a Struct column and a DataFrame:
    I think we can accomplish this without introducing a new function decorator 
or a parameter, but by the context how `pandas_udf` is used. For instance, 
`withColumn` add a column, so the return type specifies the column type, where 
`groupby().apply()` maps a DataFrame, so the return type specifies dataframe 
schema.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

Reply via email to