[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20163
  
    The problem here seems, `returnType` is mismatched to the value. In case of 
`DateType`, it needs an explicit conversion into integers:
    
    
https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L170-L171
    
    
https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L173-L175
    
    which will be called via in `worker.py`
    
    
https://github.com/apache/spark/blob/64817c423c0d82a805abd69a3e166e5bfd79c739/python/pyspark/worker.py#L70-L74
    
    If the `returnType` is `StringType`, then it doesn't need the conversion 
because Pyrolite and serialization work fine between them up to my knowledge:
    
    
https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L141-L145
    
    
https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L76-L82
    
    
    So, here:
    
    
    
https://github.com/apache/spark/blob/64817c423c0d82a805abd69a3e166e5bfd79c739/python/pyspark/worker.py#L70-L74
    
    
    we will send the return values as are without conversion, which ends up 
with `datetime.date` -> `java.util.Calendar` as you described in the PR 
description. Therefore, I don't think the current fix in `EvaluatePython.scala` 
is reachable in the reproducer above.
    
    For the fix in Python side in `udf.py`, this is a band-aid fix. To deal 
with this problem correctly, I believe we should do something like:
    
    ```diff
    diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
    index 146e673ae97..37137e02c08 100644
    --- a/python/pyspark/sql/types.py
    +++ b/python/pyspark/sql/types.py
    @@ -144,6 +144,17 @@ class StringType(AtomicType):
    
         __metaclass__ = DataTypeSingleton
    
    +    def needConversion(self):
    +        return True
    +
    +    def toInternal(self, v):
    +        if v is not None:
    +            return str(v)
    +
    +    def fromInternal(self, v):
    +        if v is not None:
    +            return str(v)
    +
    ```
    
    but then this will bring performance regression because `str` is required 
to be called every value. This extra function call could cause performance 
regression, for example, see both https://github.com/apache/spark/pull/19246 
and https://github.com/apache/spark/pull/19249.
    
    I am less sure this is something we should allow. Can we simply document 
this saying `returnType` should be compatible with the actual return value?




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

Reply via email to