[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

icexelloss Fri, 13 Oct 2017 06:52:51 -0700

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/18732
  
    @cloud-fan, it's a good question, I thought quite a bit about it and 
discussed with @viirya 
-https://github.com/apache/spark/pull/18732#pullrequestreview-66106082
    
    Just to recap, I think from a API perspective, having just one decorator 
`pandas_udf` making it easier for user to use - they don't need to think about 
which decorator to use where.  It does make it a little bit complicated for 
implementation because some code have to interpret the context in which a 
pandas_udf is used, i.e., `pandas_udf` in `groupby()apply()` is a 
`pandas.DataFrame -> pandas.DataFrame`, and in `withColumn`, `select` it's 
`pandas.Series -> pandas.Series`.
    
    Another thought is even if we were to introduce something like 
`pandas_df_udf`, for instance, we might still run into issues in the future, 
where, say, we want a aggregate pandas udf that defines mapping `pandas.Series 
-> scalar`, so I don't think we can define a decorator for every input/output 
shape because there can potentially be many.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

Reply via email to