zero323 commented on issue #27165: [SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types URL: https://github.com/apache/spark/pull/27165#issuecomment-574124028 @BryanCutler > Using type hints can be a clear way to define the inputs/output of a UDF, but it might be problematic to be the only way a user can make a PandasUDF. What about if the user has a function they cannot modify to include type hints, or using lambda/partial functions? For these cases, it makes sense to use a wrapper like before, e.g. pandas_udf(func). While I agree with the conclusion (but there is no immediate plan to deprecate current style anyway) and had similar concerns these are mostly unjustified. Since suggested changes don't depend on static analysis, user is free to provide annotations on runtime: ```python from pyspark.sql.pandas.typehints import infer_eval_type from inspect import signature def f(x: pd.Series, y: pd.Series) -> pd.Series: return x + y infer_eval_type(signature(f)) ## 200 def g(x, y): return x + y infer_eval_type(signature(g)) ## ValueError ## ... ## ValueError: Type hints for all parameters should be specified; however, got (x, y) g.__annotations__ = { 'x': pd.Series, 'y': pd.Series, 'return': pd.Series } infer_eval_type(signature(g)) ## 200 ``` Same approach can be used with `lambdas`, and in-lined with a simple wrapper, like this ```python def with_annotations(f, annotations): f.__annotations = annotations return annotations ``` though given Pandas API design, lambdas are hardly useful. As of partial functions - I am aware of more than one popular implementation, but if we're talking about `functools.partial`, `infer_eval_type` could be easily adjusted to eliminate `functools.partial.args` and `functools.partial.keywords` before proceeding. > I'd also worry that forcing type hints might be off-putting to some users, since they are not that widely used or optional. This proposal hardly forces usage of type hints, and in fact, it makes usage of type hints harder. It should be rather seen as a syntactic sugar, utilizing long established mechanism. Personally I don't have strong feelings about (concerns raised in the JIRA ticket and design document notwithstanding), but it should be pointed out, that mechanisms used here, are already utilized in the Python standard library, so cannot be seen as particularly exotic. > I'm also not too sure about changing some of the PandasUDFs to use regular functions, like with df.groupby.apply(udf). As mentioned by @HyukjinKwon I've agitated for this change and I don't want to repeat all the points I've raised before. I think it is important to address these two points: > I think it makes things less consistent by not using a pandas_udf for everything, IMHO this is consistency from a perspective of a developer, not a user. It comes at significant price (growing number of execution types that user has to be aware of) and doesn't make any sense if you take a look at the types (please refer to design document for more detailed argument). We do have methods that have similar semantics in both Scala ("strongly" typed `Dataset` transformations), in SparkR (already mentioned) and 3rd party extensions (`sparklyr`'s `spark_apply`). None requires end users to be aware of nitty-gritty details of the internal execution model. It is a bit like saying that `KeyValueGroupedDataset,mapGroups` should be used as long as user provides corresponding `LogicalPlan` node. From where I stand forcing udf (with all the related complexity) is more a leaky abstraction, than a consistent design. > and it could be inconvenient for the user to keep specifying the schema as an argument for multiple calls, For n-to-m, `pd.DataFrame` -> `pd.DataFrame` transformations users are as likely to have fixed schema, as schema varying based on the input Spark `DataFrame` (for example column additions and all-column aggregations). In such cases the alternative is to redefine `udf` for each possible input. Arguably this is at least as inconvenient and, in many cases hard to manage (as udf definitions have to be mixed with pipelines, to be able to access runtime schema).
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org