HyukjinKwon commented on issue #27165: [SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types URL: https://github.com/apache/spark/pull/27165#issuecomment-573971411 Yes, I think they can wrap it and should not be a big problem. > I'd also worry that forcing type hints might be off-putting to some users, since they are not that widely used or optional. That's my concern too which makes me to decide not to deprecate existing pandas UDFs for now. If we add proposal 1 way too, then we will have 3 ways (old one, proposal 1, and proposal 2) which I think might be more confusing. So my current take is to add the type hint way first, and see if users like it or not in Spark 3 (without any deprecation for now) > I'm also not too sure about changing some of the PandasUDFs to use regular functions, like with df.groupby.apply(udf). I think it makes things less consistent by not using a pandas_udf for everything, and it could be inconvenient for the user to keep specifying the schema as an argument for multiple calls, instead of just binding it once with pandas_udf. It should be possible to still remove the udf type from the API so the following could be done which is almost similar to the proposed change: > > ```python > df.groupby.apply(pandas_udf(f, schema=...)) > ``` I didn't have a strong preference on this change before. This was actually suggested by @zero323. I took a look several times more and was convinced by my suggestion. The current `udf` returns a `Column` which is able to bind with other SQL expressions whereas `pandas_udf` has three inconsistent cases `df.mapInPandas(udf)`, `df.groupby.apply(udf)` and `df.groupby.cogroup.apply(udf)`. Other expressions cannot be accepted in these APIs either (and looks impossible to fix it to take other expressions). - If we can remove the three cases above, `pandas_udf` and `udf` are consistent with other functions as well. - These naming of APIs `df.groupby.apply` and `df.groupby.cogroup.apply` are not clear if they are pandas UDF specific. - Although we don't have UDF in SparkR, there are similar API shape in SparkR, `dapply` and `gapply`. So, this isn't completely new in Spark. ```r dapply(df, function(rdf) { ... }, structType("gear double"))) ``` ```r gapply(df, "k", function(k, g) { ... }, structType("gear double, disp boolean"))) ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
