Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22610#discussion_r223173637
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2909,6 +2909,11 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
             can fail on special rows, the workaround is to incorporate the 
condition into the functions.
     
         .. note:: The user-defined functions do not take keyword arguments on 
the calling side.
    +
    +    .. note:: The data type of returned `pandas.Series` from the 
user-defined functions should be
    +        matched with defined returnType. When there is mismatch between 
them, it is not guaranteed
    +        that the conversion by SparkSQL during serialization is correct at 
all and users might get
    --- End diff --
    
    >  an attempt will be made to cast the data and results should be checked 
for accuracy."
    
    it sounds like the casting is intentional. I think the casting logic is not 
that clear as far as I can tell, comparing SQL casting logic. Can we leave this 
not guaranteed for now and document the casting logic here instead? Does Arrow 
have some kind of documentation for type conversion BTW?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to