[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

rednaxelafx Fri, 05 Jan 2018 16:11:18 -0800

Github user rednaxelafx commented on the issue:

    https://github.com/apache/spark/pull/20163
  
    Thanks for all of your comments, @HyukjinKwon and @icexelloss !
    I'd like to wait for more discussions / suggestions on whether or not we 
want a behavior change that makes this reproducer work, or a simple document 
change that'll just say PySpark doesn't support mismatching `returnType`.
    
    All of what you guys mentioned are correct. Sorry for the mess, I actually 
got myself confused...
    It's been a while since I first noticed the problem and came up with a 
patch, and obviously when I picked it back up I've lost some context as to what 
exactly failed in the first place...
    
    Both @HyukjinKwon and @icexelloss correctly pointed out that the bug only 
happens when the `udf()` creation declared a mismatching type versus what it 
actually returns. In the reproducer, the declared UDF return type is the 
default `string` type but the actual return types were `datetime.date` / 
`datetime.datetime`. That followed the path of not going through in 
`pyspark/sql/types.py`, so it went through to Pyrolite getting unpickled as a 
`java.util.Calendar`.
    
    A note on how I got here: the reason why my current PR (incorrectly) 
contained the cases for `case (Calendar, DateType)` and friends was that, 
initially I only had a reproducer for a Python UDF actually returning 
`datetime.date` but was using the default return type (`string`), and I had 
fixed it by introducing a case for converting from `java.util.Calendar` to the 
appropriate type in `EvaluatePython.scala`. At that time that case was 
certainly executed and it did give me correct results for that single 
reproducer. I did notice that the cases where the UDF return type was correctly 
declared was working correctly.
    
    But then I realized I also needed to handle the case where I have to tell 
apart `datetime.date` and `datetime.datetime`, and then I can't do that in a 
single `case (Calendar, StringType)` in `EvaluatePython.scala` anymore because 
I'm lacking the type information from the Python side at that point. So I went 
back to the Python side and thought I needed to handle `datetime.date` / 
`datetime.datetime` separately, but eventually they ended up both being handled 
by just a `str(value)` coercion in a band-aid fix in `udf.py`. The version that 
@HyukjinKwon suggested above is indeed a proper version of that.
    At this point the new code in `EvaluatePython.scala` and friends are dead 
code.
    
    To address a point from @icexelloss :
    > The change that the PR proposes seem to be coercing python 
datetime.datetime and datetime.date to the python datetime string 
representation rather the java one.
    The reason why I used `str()` there was that both Python and Spark SQL 
followed the same default string formats for date (`yyyy-MM-dd`) and datetime 
(`yyyy-MM-dd HH:mm:ss`), e.g.
    ```c
    static PyObject *
    date_isoformat(PyDateTime_Date *self)
    {
        return PyUnicode_FromFormat("%04d-%02d-%02d",
                                    GET_YEAR(self), GET_MONTH(self), 
GET_DAY(self));
    }
    ```
    and
    ```scala
      // `SimpleDateFormat` is not thread-safe.
      private val threadLocalDateFormat = new ThreadLocal[DateFormat] {
        override def initialValue(): SimpleDateFormat = {
          new SimpleDateFormat("yyyy-MM-dd", Locale.US)
        }
      }
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

Reply via email to