[ 
https://issues.apache.org/jira/browse/SPARK-32612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178927#comment-17178927
 ] 

Robert Joseph Evans commented on SPARK-32612:
---------------------------------------------

This is just one example that shows what can happen. Yes, if I am getting 
overflows arguably a non-consistent floating-point cast is not my biggest 
problem. I used it simply to better show the difference in the result.  There 
are other similar problems that can show up but are much more subtle.  A double 
has 52 bits in the fraction section of the format.  The format plays games with 
that but any long value that uses more than 53 bits will produce different 
results if it is first cast to a float for processing.

Is this a serious problem? I'm not totally sure, which is why I left it as the 
default priority. It violates the principle of least surprise. As a developer, 
I would not expect this to happen.  Is it likely to be something that someone 
sees in production? I doubt it. The only reason I found this was because I was 
testing with random numbers to try and validate changes I was making as a part 
of https://github.com/NVIDIA/spark-rapids. I mostly wanted to be sure that this 
was documented. I'll leave it up to the community if this is something that is 
important enough to fix.

> int columns produce inconsistent results on pandas UDFs
> -------------------------------------------------------
>
>                 Key: SPARK-32612
>                 URL: https://issues.apache.org/jira/browse/SPARK-32612
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.0.0
>            Reporter: Robert Joseph Evans
>            Priority: Major
>
> This is similar to SPARK-30187 but I personally consider this data corruption.
> If I have a simple pandas UDF
> {code}
>  >>> def add(a, b):
>         return a + b
>  >>> my_udf = pandas_udf(add, returnType=LongType())
> {code}
> And I want to process some data with it, say 32 bit ints
> {code}
> >>> df = spark.createDataFrame([(1037694399, 1204615848),(3,4)], 
> >>> StructType([StructField("a", IntegerType()), StructField("b", 
> >>> IntegerType())]))
> >>> df.select(my_udf(col("a") - 3, col("b")).show()
> +----------+----------+---------------+
> |         a|         b|add((a - 3), b)|
> +----------+----------+---------------+
> |1037694399|1204615848|    -2052657052|
> |         3|         4|              4|
> +----------+----------+---------------+
> {code}
> I get an integer overflow for the data as I would expect.  But as soon as I 
> add a {{None}} to the data, even on a different row the result I get back is 
> totally different.
> {code}
> >>> df = spark.createDataFrame([(1037694399, 1204615848),(3,None)], 
> >>> StructType([StructField("a", IntegerType()), StructField("b", 
> >>> IntegerType())]))
> >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show()
> +----------+----------+---------------+
> |         a|         b|add((a - 3), b)|
> +----------+----------+---------------+
> |1037694399|1204615848|     2242310244|
> |         3|      null|           null|
> +----------+----------+---------------+
> {code}
> The integer overflow disappears.  This is because arrow and/or pandas changes 
> the data type to a float in order to be able to store the null value.  So 
> then the processing is being done on floating point there is no overflow.  
> This in and of itself is annoying but understandable because it is dealing 
> with a limitation in pandas. 
> Where it becomes a bug is that this happens on a per batch basis.  This means 
> that I can have the same two rows in different parts of my data set and get 
> different results depending on their proximity to a null value.
> {code}
> >>> df = spark.createDataFrame([(1037694399, 
> >>> 1204615848),(3,None),(1037694399, 1204615848),(3,4)], 
> >>> StructType([StructField("a", IntegerType()), StructField("b", 
> >>> IntegerType())]))
> >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show()
> +----------+----------+---------------+
> |         a|         b|add((a - 3), b)|
> +----------+----------+---------------+
> |1037694399|1204615848|     2242310244|
> |         3|      null|           null|
> |1037694399|1204615848|     2242310244|
> |         3|         4|              4|
> +----------+----------+---------------+
> >>> spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', '2')
> >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show()
> +----------+----------+---------------+
> |         a|         b|add((a - 3), b)|
> +----------+----------+---------------+
> |1037694399|1204615848|     2242310244|
> |         3|      null|           null|
> |1037694399|1204615848|    -2052657052|
> |         3|         4|              4|
> +----------+----------+---------------+
> {code}
> For me personally I would prefer to have all nullable integer columns 
> upgraded to float all the time, that way it is at least consistent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to