[ 
https://issues.apache.org/jira/browse/SPARK-47134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dylan Walker updated SPARK-47134:
---------------------------------
    Description: 
In specific cases, casting decimal values can result in `null` values where no 
overflow exists.

The cases appear very specific, and I don't have the depth of knowledge to 
generalize this issue, so here is a simple spark-shell reproduction:

*Setup:*

{code:scala}
scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", x)).toDS
ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]

scala> ds.createOrReplaceTempView("t")
{code}
 
*Spark 3.2.1 behaviour (correct):*

{code:scala}
scala> spark.sql("select CAST(SUM(1.00000000000000) AS DECIMAL(28,14)) as ct 
FROM t GROUP BY `_1` ORDER BY ct ASC").show()
+--------------------+
|ct|

+--------------------+
|9508.00000000000000|
|13879.00000000000000|

+--------------------+
{code}

*Spark 3.4.1 / Spark 3.5.0 behaviour:*

{code:scala}
scala> spark.sql("select CAST(SUM(1.00000000000000) AS DECIMAL(28,14)) as ct 
FROM t GROUP BY `_1` ORDER BY ct ASC").show()
+-------------------+
|ct|

+-------------------+
|null|
|9508.00000000000000|

+-------------------+
{code}

This is fairly delicate:
 - removing the `ORDER BY` clause produces the correct result
 - removing the `CAST` produces the correct result
 - changing the number of 0s in the argument to `SUM` produces the correct 
result
 - setting `spark.ansi.enabled` to `true` produces the correct result (and does 
not throw an error)

Also, removing the `ORDER BY`, but writing `ds` to a parquet will also result 
in the unexpected nulls.

Please let me know if you need additional information.

We are also interested in understanding whether setting `spark.ansi.enabled` 
can be considered a reliable workaround to this issue prior to a fix being 
released, if possible.
 

  was:
In specific cases, casting decimal values can result in `null` values where no 
overflow exists.

 

The cases appear very specific, and I don't have the depth of knowledge to 
generalize this issue, so here is a simple spark-shell reproduction:

 

Setup:

{code:scala}
scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", x)).toDS
ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]

scala> ds.createOrReplaceTempView("t")
{code}
 

Spark 3.2.1 behaviour (correct):

{code:scala}
scala> spark.sql("select CAST(SUM(1.00000000000000) AS DECIMAL(28,14)) as ct 
FROM t GROUP BY `_1` ORDER BY ct ASC").show()
+--------------------+
|ct|

+--------------------+
|9508.00000000000000|
|13879.00000000000000|

+--------------------+
{code}

Spark 3.4.1 / Spark 3.5.0 behaviour:

{code:scala}
scala> spark.sql("select CAST(SUM(1.00000000000000) AS DECIMAL(28,14)) as ct 
FROM t GROUP BY `_1` ORDER BY ct ASC").show()
+-------------------+
|ct|

+-------------------+
|null|
|9508.00000000000000|

+-------------------+
{code}

This is fairly delicate:
 - removing the `ORDER BY` clause produces the correct result
 - removing the `CAST` produces the correct result
 - changing the number of 0s in the argument to `SUM` produces the correct 
result
 - setting `spark.ansi.enabled` to `true` produces the correct result (and does 
not throw an error)

Also, removing the `ORDER BY`, but writing `ds` to a parquet will also result 
in the unexpected nulls.

Please let me know if you need additional information.

We are also interested in understanding whether setting `spark.ansi.enabled` 
can be considered a reliable workaround to this issue prior to a fix being 
released, if possible.
 


> Unexpected nulls when casting decimal values in specific cases
> --------------------------------------------------------------
>
>                 Key: SPARK-47134
>                 URL: https://issues.apache.org/jira/browse/SPARK-47134
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.4.1, 3.5.0
>            Reporter: Dylan Walker
>            Priority: Major
>
> In specific cases, casting decimal values can result in `null` values where 
> no overflow exists.
> The cases appear very specific, and I don't have the depth of knowledge to 
> generalize this issue, so here is a simple spark-shell reproduction:
> *Setup:*
> {code:scala}
> scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", 
> x)).toDS
> ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]
> scala> ds.createOrReplaceTempView("t")
> {code}
>  
> *Spark 3.2.1 behaviour (correct):*
> {code:scala}
> scala> spark.sql("select CAST(SUM(1.00000000000000) AS DECIMAL(28,14)) as ct 
> FROM t GROUP BY `_1` ORDER BY ct ASC").show()
> +--------------------+
> |ct|
> +--------------------+
> |9508.00000000000000|
> |13879.00000000000000|
> +--------------------+
> {code}
> *Spark 3.4.1 / Spark 3.5.0 behaviour:*
> {code:scala}
> scala> spark.sql("select CAST(SUM(1.00000000000000) AS DECIMAL(28,14)) as ct 
> FROM t GROUP BY `_1` ORDER BY ct ASC").show()
> +-------------------+
> |ct|
> +-------------------+
> |null|
> |9508.00000000000000|
> +-------------------+
> {code}
> This is fairly delicate:
>  - removing the `ORDER BY` clause produces the correct result
>  - removing the `CAST` produces the correct result
>  - changing the number of 0s in the argument to `SUM` produces the correct 
> result
>  - setting `spark.ansi.enabled` to `true` produces the correct result (and 
> does not throw an error)
> Also, removing the `ORDER BY`, but writing `ds` to a parquet will also result 
> in the unexpected nulls.
> Please let me know if you need additional information.
> We are also interested in understanding whether setting `spark.ansi.enabled` 
> can be considered a reliable workaround to this issue prior to a fix being 
> released, if possible.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to