BryanCutler commented on issue #27165: [SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types URL: https://github.com/apache/spark/pull/27165#issuecomment-574320076 >On the first one they can wrap an existing function can’t they? They don’t need to modify it. Yeah, it's just an inconvenience for the user to define a wrapper function and then wrap that again in `pandas_udf`. >That's my concern too which makes me to decide not to deprecate existing pandas UDFs for now. If we add proposal 1 way too, then we will have 3 ways (old one, proposal 1, and proposal 2) which I think might be more confusing. So my current take is to add the type hint way first, and see if users like it or not in Spark 3 (without any deprecation for now) Sure, this sounds like a good path forward @HyukjinKwon . My concerns are only if type hints are required. >The current udf returns a Column which is able to bind with other SQL expressions whereas Actually, `pandas_udf` returns a `UserDefinedFunction`. It's when that is called with `Columns` as input, it returns a `Column`. >We do have methods that have similar semantics in both Scala ("strongly" typed Dataset transformations), in SparkR (already mentioned) and 3rd party extensions (sparklyr's spark_apply). None requires end users to be aware of nitty-gritty details of the internal execution model. @zero323 Maybe I'm thinking of this from the users perspective, but to me `pandas_udf` in this context is really just a function bound with output schema. So it seems like we are just talking about ```python df.groupby("...").applyInPandas(func, schema=df.schema) ``` vs ```python df.groupby("...").apply(pandas_udf(func, returnType=df.schema)) ``` Which is just a subtle difference, so I'm not sure how this requires the user to the the details of internal execution.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
