Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20163
The problem here seems, `returnType` is mismatched to the value. In case of
`DateType`, it needs an explicit conversion into integers:
https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L170-L171
https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L173-L175
which will be called via in `worker.py`
https://github.com/apache/spark/blob/64817c423c0d82a805abd69a3e166e5bfd79c739/python/pyspark/worker.py#L70-L74
If the `returnType` is `StringType`, then it doesn't need the conversion
because Pyrolite and serialization work fine between them up to my knowledge:
https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L141-L145
https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L76-L82
So, here:
https://github.com/apache/spark/blob/64817c423c0d82a805abd69a3e166e5bfd79c739/python/pyspark/worker.py#L70-L74
we will send the return values as are without conversion, which ends up
with `datetime.date` -> `java.util.Calendar` as you described in the PR
description. Therefore, I don't think the current fix in `EvaluatePython.scala`
is reachable in the reproducer above.
For the fix in Python side in `udf.py`, this is a band-aid fix. To deal
with this problem correctly, I believe we should do something like:
```diff
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 146e673ae97..37137e02c08 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -144,6 +144,17 @@ class StringType(AtomicType):
__metaclass__ = DataTypeSingleton
+ def needConversion(self):
+ return True
+
+ def toInternal(self, v):
+ if v is not None:
+ return str(v)
+
+ def fromInternal(self, v):
+ if v is not None:
+ return str(v)
+
```
but then this will bring performance regression because `str` is required
to be called every value. This extra function call could cause performance
regression, for example, see both https://github.com/apache/spark/pull/19246
and https://github.com/apache/spark/pull/19249.
I am less sure this is something we should allow. Can we simply document
this saying `returnType` should be compatible with the actual return value?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]