skambha commented on pull request #29125:
URL: https://github.com/apache/spark/pull/29125#issuecomment-668179580
@cloud-fan,
In some overflow scenarios:
With just this back port change it will cause incorrect results to be
returned to the user now
Before this change, the user would see error
The test cases in DataFrameSuite will show these scenarios. Here is an
example taken from there that I tried on spark 3.0.1 with and without this
change and you can see this incorrect result behavior.
This back port by itself causes more scenarios to return incorrect results
to the user.
1) With this back port change, **incorrect results:**
```
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.1-SNAPSHOT
/_/
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_251)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val decStr = "1" + "0" * 19
decStr: String = 10000000000000000000
scala> val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11,
1, 1))
d3: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> val d5 = d3.select(expr(s"cast('$decStr' as decimal (38,
18)) as d"),
|
lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]
scala> d5.show(false)
+---------------------------------------+
|sumd |
+---------------------------------------+
|20000000000000000000.000000000000000000|
+---------------------------------------+
```
2. With this change, **incorrect results** with ansi enabled mode as well.
```Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.1-SNAPSHOT
/_/
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_251)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.conf.set("spark.sql.ansi.enabled","true")
scala> val decStr = "1" + "0" * 19
decStr: String = 10000000000000000000
scala> val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
d3: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as d"),
| |
lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]
scala> d5.show(false)
+---------------------------------------+
|sumd |
+---------------------------------------+
|20000000000000000000.000000000000000000|
+---------------------------------------+
```
WITHOUT THIS CHANGE. the same test will throw an error for both the cases
(ansi enabled) and not.
```
scala> val decStr = "1" + "0" * 19
decStr: String = 10000000000000000000
scala> val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
d3: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as d"),
|
lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]
scala> d5.show(false)
20/08/03 11:15:05 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID
1)/ 2]
java.lang.ArithmeticException: Decimal precision 39 exceeds max precision 38
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:122)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:574)
at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
at
org.apache.spark.sql.catalyst.expressions.UnsafeRow.getDecimal(UnsafeRow.java:393)
```
Without this change, the ansi enabled scenario also throws error.
```
scala> spark.conf.set("spark.sql.ansi.enabled","true")
scala> val decStr = "1" + "0" * 19
decStr: String = 10000000000000000000
scala> val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
d3: org.apache.spark.sql.Dataset[Long] = [id: bigint]
val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as d"),
|
lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]
scala> d5.show(false)
20/08/03 11:18:08 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.ArithmeticException: Decimal precision 39 exceeds max precision 38
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:122)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:574)
at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
at
org.apache.spark.sql.catalyst.expressions.UnsafeRow.getDecimal(UnsafeRow.java:393)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.agg_doAggregate_sum_0$(generated.java:41)
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]