Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22610#discussion_r223173637
--- Diff: python/pyspark/sql/functions.py ---
@@ -2909,6 +2909,11 @@ def pandas_udf(f=None, returnType=None,
functionType=None):
can fail on special rows, the workaround is to incorporate the
condition into the functions.
.. note:: The user-defined functions do not take keyword arguments on
the calling side.
+
+ .. note:: The data type of returned `pandas.Series` from the
user-defined functions should be
+ matched with defined returnType. When there is mismatch between
them, it is not guaranteed
+ that the conversion by SparkSQL during serialization is correct at
all and users might get
--- End diff --
> an attempt will be made to cast the data and results should be checked
for accuracy."
it sounds like the casting is intentional. I think the casting logic is not
that clear as far as I can tell, comparing SQL casting logic. Can we leave this
not guaranteed for now and document the casting logic here instead? Does Arrow
have some kind of documentation for type conversion BTW?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]