skambha commented on pull request #29125:
URL: https://github.com/apache/spark/pull/29125#issuecomment-668179580


   @cloud-fan,  
   In some overflow scenarios: 
   With just this back port change  it will cause incorrect results to be 
returned to the user now
   Before this change, the user would see error
   
   The test cases in DataFrameSuite will show these scenarios.   Here is an 
example taken from there that I tried on spark 3.0.1 with and without this 
change and you can see this incorrect result behavior. 
   
   This back port by itself causes more scenarios to return incorrect results 
to the user.  
   
   1) With this back port change, **incorrect results:** 
   ```
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /___/ .__/\_,_/_/ /_/\_\   version 3.0.1-SNAPSHOT
         /_/
   
   Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
1.8.0_251)
   Type in expressions to have them evaluated.
   Type :help for more information.
   
   scala>  val decStr = "1" + "0" * 19
   decStr: String = 10000000000000000000
   
   scala>            val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 
1, 1))
   d3: org.apache.spark.sql.Dataset[Long] = [id: bigint] 
       
   scala>             val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 
18)) as d"),
        |               
lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
   d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]
      
   
   scala> d5.show(false)
   +---------------------------------------+
   |sumd                                   |
   +---------------------------------------+
   |20000000000000000000.000000000000000000|
   +---------------------------------------+
   
   ```
   
   2. With this change, **incorrect results** with ansi enabled mode as well. 
   ```Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /___/ .__/\_,_/_/ /_/\_\   version 3.0.1-SNAPSHOT
         /_/
   
   Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
1.8.0_251)
   Type in expressions to have them evaluated.
   Type :help for more information.
   
   scala> spark.conf.set("spark.sql.ansi.enabled","true")
   
   scala> val decStr = "1" + "0" * 19
   decStr: String = 10000000000000000000
   
   scala> val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
   d3: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   
   scala>  val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as d"),
        |      |               
lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
   d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]
   
   scala> d5.show(false)
   +---------------------------------------+
   |sumd                                   |
   +---------------------------------------+
   |20000000000000000000.000000000000000000|
   +---------------------------------------+
   ```
   
   WITHOUT THIS CHANGE.  the same test will throw an error for both the cases 
(ansi enabled) and not. 
   ```
   scala> val decStr = "1" + "0" * 19
   decStr: String = 10000000000000000000
   
   scala> val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
   d3: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   
   scala> val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as d"),
        |          
lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
   d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]
   
   scala> d5.show(false)
   20/08/03 11:15:05 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 
1)/ 2]
   java.lang.ArithmeticException: Decimal precision 39 exceeds max precision 38
           at org.apache.spark.sql.types.Decimal.set(Decimal.scala:122)
           at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:574)
           at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
           at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.getDecimal(UnsafeRow.java:393)
   ```
   
   Without this change, the ansi enabled scenario also throws error. 
   ```
   scala> spark.conf.set("spark.sql.ansi.enabled","true")
   
   scala> val decStr = "1" + "0" * 19
   decStr: String = 10000000000000000000
   
   scala>  val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
   d3: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   
   val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as d"),
        |        
lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
   d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]
   
   scala> d5.show(false)
   20/08/03 11:18:08 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
   java.lang.ArithmeticException: Decimal precision 39 exceeds max precision 38
           at org.apache.spark.sql.types.Decimal.set(Decimal.scala:122)
           at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:574)
           at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
           at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.getDecimal(UnsafeRow.java:393)
           at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.agg_doAggregate_sum_0$(generated.java:41)
   
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to