Github user rednaxelafx commented on the issue:
https://github.com/apache/spark/pull/20163
Thanks for all of your comments, @HyukjinKwon and @icexelloss !
I'd like to wait for more discussions / suggestions on whether or not we
want a behavior change that makes this reproducer work, or a simple document
change that'll just say PySpark doesn't support mismatching `returnType`.
All of what you guys mentioned are correct. Sorry for the mess, I actually
got myself confused...
It's been a while since I first noticed the problem and came up with a
patch, and obviously when I picked it back up I've lost some context as to what
exactly failed in the first place...
Both @HyukjinKwon and @icexelloss correctly pointed out that the bug only
happens when the `udf()` creation declared a mismatching type versus what it
actually returns. In the reproducer, the declared UDF return type is the
default `string` type but the actual return types were `datetime.date` /
`datetime.datetime`. That followed the path of not going through in
`pyspark/sql/types.py`, so it went through to Pyrolite getting unpickled as a
`java.util.Calendar`.
A note on how I got here: the reason why my current PR (incorrectly)
contained the cases for `case (Calendar, DateType)` and friends was that,
initially I only had a reproducer for a Python UDF actually returning
`datetime.date` but was using the default return type (`string`), and I had
fixed it by introducing a case for converting from `java.util.Calendar` to the
appropriate type in `EvaluatePython.scala`. At that time that case was
certainly executed and it did give me correct results for that single
reproducer. I did notice that the cases where the UDF return type was correctly
declared was working correctly.
But then I realized I also needed to handle the case where I have to tell
apart `datetime.date` and `datetime.datetime`, and then I can't do that in a
single `case (Calendar, StringType)` in `EvaluatePython.scala` anymore because
I'm lacking the type information from the Python side at that point. So I went
back to the Python side and thought I needed to handle `datetime.date` /
`datetime.datetime` separately, but eventually they ended up both being handled
by just a `str(value)` coercion in a band-aid fix in `udf.py`. The version that
@HyukjinKwon suggested above is indeed a proper version of that.
At this point the new code in `EvaluatePython.scala` and friends are dead
code.
To address a point from @icexelloss :
> The change that the PR proposes seem to be coercing python
datetime.datetime and datetime.date to the python datetime string
representation rather the java one.
The reason why I used `str()` there was that both Python and Spark SQL
followed the same default string formats for date (`yyyy-MM-dd`) and datetime
(`yyyy-MM-dd HH:mm:ss`), e.g.
```c
static PyObject *
date_isoformat(PyDateTime_Date *self)
{
return PyUnicode_FromFormat("%04d-%02d-%02d",
GET_YEAR(self), GET_MONTH(self),
GET_DAY(self));
}
```
and
```scala
// `SimpleDateFormat` is not thread-safe.
private val threadLocalDateFormat = new ThreadLocal[DateFormat] {
override def initialValue(): SimpleDateFormat = {
new SimpleDateFormat("yyyy-MM-dd", Locale.US)
}
}
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]