[GitHub] [spark] HyukjinKwon commented on issue #27165: [SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types

GitBox Mon, 13 Jan 2020 18:37:11 -0800

HyukjinKwon commented on issue #27165: [SPARK-28264][PYTHON][SQL] Support type 
hints in pandas UDF and rename/move inconsistent pandas UDF types
URL: https://github.com/apache/spark/pull/27165#issuecomment-573971411
 
 
   Yes, I think they can wrap it and should not be a big problem.
   
   > I'd also worry that forcing type hints might be off-putting to some users, 
since they are not that widely used or optional.
   
   That's my concern too which makes me to decide not to deprecate existing 
pandas UDFs for now. If we add proposal 1 way too, then we will have 3 ways 
(old one, proposal 1, and proposal 2) which I think might be more confusing. So 
my current take is to add the type hint way first, and see if users like it or 
not in Spark 3 (without any deprecation for now)
   
   > I'm also not too sure about changing some of the PandasUDFs to use regular 
functions, like with df.groupby.apply(udf). I think it makes things less 
consistent by not using a pandas_udf for everything, and it could be 
inconvenient for the user to keep specifying the schema as an argument for 
multiple calls, instead of just binding it once with pandas_udf. It should be 
possible to still remove the udf type from the API so the following could be 
done which is almost similar to the proposed change:
   >
   > ```python
   > df.groupby.apply(pandas_udf(f, schema=...))
   > ```
   
   I didn't have a strong preference on this change before. This was actually 
suggested by @zero323.
   I took a look several times more and was convinced by my suggestion.
   
   The current `udf` returns a `Column` which is able to bind with other SQL 
expressions whereas `pandas_udf` has three inconsistent cases 
`df.mapInPandas(udf)`, `df.groupby.apply(udf)` and 
`df.groupby.cogroup.apply(udf)`. Other expressions cannot be accepted in these 
APIs either (and looks impossible to fix it to take other expressions).
   
   - If we can remove the three cases above, `pandas_udf` and `udf` are 
consistent with other functions as well.
   
   - These naming of APIs `df.groupby.apply` and `df.groupby.cogroup.apply` are 
not clear if they are pandas UDF specific.
   
   - Although we don't have UDF in SparkR, there are similar API shape in 
SparkR, `dapply` and `gapply`. So, this isn't completely new in Spark.
   
       ```r
       dapply(df, function(rdf) { ... }, structType("gear double")))
       ```
   
       ```r
       gapply(df, "k", function(k, g) { ... }, structType("gear double, disp 
boolean")))
       ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on issue #27165: [SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types

Reply via email to