[ 
https://issues.apache.org/jira/browse/SPARK-47134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819834#comment-17819834
 ] 

Dylan Walker commented on SPARK-47134:
--------------------------------------

[~bersprockets]

Hmm, it's possible I may have made too many assumptions.  I left out that this 
is on EMR, which does have its own fork of Spark.

If this is referring to names that don't exist in the Apache Spark codebase, 
this may be an Amazon thing.  I will reach out to AWS support to confirm, and 
apologies if this turns out to be the case.

> Unexpected nulls when casting decimal values in specific cases
> --------------------------------------------------------------
>
>                 Key: SPARK-47134
>                 URL: https://issues.apache.org/jira/browse/SPARK-47134
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.4.1, 3.5.0
>            Reporter: Dylan Walker
>            Priority: Major
>         Attachments: 321queryplan.txt, 341queryplan.txt
>
>
> In specific cases, casting decimal values can result in `null` values where 
> no overflow exists.
> The cases appear very specific, and I don't have the depth of knowledge to 
> generalize this issue, so here is a simple spark-shell reproduction:
> *Setup:*
> {code:scala}
> scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", 
> x)).toDS
> ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]
> scala> ds.createOrReplaceTempView("t")
> {code}
>  
> *Spark 3.2.1 behaviour (correct):*
> {code:scala}
> scala> spark.sql("select CAST(SUM(1.00000000000000) AS DECIMAL(28,14)) as ct 
> FROM t GROUP BY `_1` ORDER BY ct ASC").show()
> +--------------------+
> |                  ct|
> +--------------------+
> | 9508.00000000000000|
> |13879.00000000000000|
> +--------------------+
> {code}
> *Spark 3.4.1 / Spark 3.5.0 behaviour:*
> {code:scala}
> scala> spark.sql("select CAST(SUM(1.00000000000000) AS DECIMAL(28,14)) as ct 
> FROM t GROUP BY `_1` ORDER BY ct ASC").show()
> +-------------------+
> |                 ct|
> +-------------------+
> |               null|
> |9508.00000000000000|
> +-------------------+
> {code}
> This is fairly delicate:
>  - removing the {{ORDER BY}} clause produces the correct result
>  - removing the {{CAST}} produces the correct result
>  - changing the number of 0s in the argument to {{SUM}} produces the correct 
> result
>  - setting {{spark.ansi.enabled}} to {{true}} produces the correct result 
> (and does not throw an error)
> Also, removing the {{ORDER BY}}, but writing {{ds}} to a parquet will also 
> result in the unexpected nulls.
> Please let me know if you need additional information.
> We are also interested in understanding whether setting 
> {{spark.ansi.enabled}} can be considered a reliable workaround to this issue 
> prior to a fix being released, if possible.
> Text files that include {{explain()}} output attached.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to