zero323 commented on issue #27165: [SPARK-28264][PYTHON][SQL] Support type 
hints in pandas UDF and rename/move inconsistent pandas UDF types
URL: https://github.com/apache/spark/pull/27165#issuecomment-574335425
 
 
   > @zero323 Maybe I'm thinking of this from the users perspective, but to me 
`pandas_udf` in this context is really just a function bound with output 
schema. So it seems like we are just talking about
   > 
   > ```python
   > df.groupby("...").applyInPandas(func, schema=df.schema)
   > ```
   > 
   > vs
   > 
   > ```python
   > df.groupby("...").apply(pandas_udf(func, returnType=df.schema))
   > ```
   
   Or some variant of that, where schema is derived from `df.schema`.
   
   > 
   > Which is just a subtle difference, so I'm not sure how this requires the 
user to the the details of internal execution.
   
   From where I stand things look like this:
   
   - As long as hints are not obligatory (and as mentioned in the design doc, 
types alone might not be sufficient to encode cardinality ) the latter will 
also require a `functionType`. This also affects which function types have to 
be exposed in the public API.
   
   - This also informs user that internally use an UDF for execution and 
requires creation of an object, that is not suitable for use an UDF if we stick 
to public API. 
   
      And without looking at the function type (which alone is not friendly for 
interactive introspection) it is not exactly obvious, that object marked as 
UDF, and implementing UDF interface:
   
       ```
       class UDFLike(Protocol):
           def __call__(self, *__args: Column) -> Column: ....
       ```
       
       is not indented to be used as UDF at all. Looks rather confusing to me, 
but I admit I am rather used to looking at PySpark API in terms of types.
   
   - Last but not least `pandas_udf` documentation  is already hard to digest - 
~400 lines and growing.
   
   The first and the last point affect anyone that uses `pandas_udf` (and the 
first one seems to drive discussion about API changes, so taking away some of 
the external complexity is good), even if user doesn't utilize this particular 
variants of UDF and I honestly cannot see any good reason for that ‒ PySpark 
users are used to higher order functions and getting wrapped function doesn't 
help to process data internally.
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to