zero323 commented on issue #27165: [SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types URL: https://github.com/apache/spark/pull/27165#issuecomment-574335425 > @zero323 Maybe I'm thinking of this from the users perspective, but to me `pandas_udf` in this context is really just a function bound with output schema. So it seems like we are just talking about > > ```python > df.groupby("...").applyInPandas(func, schema=df.schema) > ``` > > vs > > ```python > df.groupby("...").apply(pandas_udf(func, returnType=df.schema)) > ``` Or some variant of that, where schema is derived from `df.schema`. > > Which is just a subtle difference, so I'm not sure how this requires the user to the the details of internal execution. From where I stand things look like this: - As long as hints are not obligatory (and as mentioned in the design doc, types alone might not be sufficient to encode cardinality ) the latter will also require a `functionType`. This also affects which function types have to be exposed in the public API. - This also informs user that internally use an UDF for execution and requires creation of an object, that is not suitable for use an UDF if we stick to public API. And without looking at the function type (which alone is not friendly for interactive introspection) it is not exactly obvious, that object marked as UDF, and implementing UDF interface: ``` class UDFLike(Protocol): def __call__(self, *__args: Column) -> Column: .... ``` is not indented to be used as UDF at all. Looks rather confusing to me, but I admit I am rather used to looking at PySpark API in terms of types. - Last but not least `pandas_udf` documentation is already hard to digest - ~400 lines and growing. The first and the last point affect anyone that uses `pandas_udf` (and the first one seems to drive discussion about API changes, so taking away some of the external complexity is good), even if user doesn't utilize this particular variants of UDF and I honestly cannot see any good reason for that ‒ PySpark users are used to higher order functions and getting wrapped function doesn't help to process data internally.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
