[spark] branch master updated: [SPARK-39749][SQL][FOLLOWUP] Move the new behavior of CAST(DECIMAL AS STRING) under ANSI SQL mode

gurwls223 Mon, 18 Jul 2022 20:15:12 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 5bd47640dc4 [SPARK-39749][SQL][FOLLOWUP] Move the new behavior of 
CAST(DECIMAL AS STRING) under ANSI SQL mode
5bd47640dc4 is described below

commit 5bd47640dc455ae27c94825743a127bbb20e59bf
Author: Gengliang Wang <gengli...@apache.org>
AuthorDate: Tue Jul 19 12:14:43 2022 +0900

    [SPARK-39749][SQL][FOLLOWUP] Move the new behavior of CAST(DECIMAL AS 
STRING) under ANSI SQL mode
    
    ### What changes were proposed in this pull request?
    
    This is a follow-up of https://github.com/apache/spark/pull/37160, which 
changes the default behavior of `CAST(DECIMAL AS STRING)` as always using the 
plain string representation for ANSI SQL compliance.
    This PR is to move the new behavior under ANSI SQL mode. The default 
behavior remains unchanged.
    ### Why are the changes needed?
    
    After offline discussions,  changing the default behavior brings upgrade 
risks to Spark 3.4.0. AFAIK, there are existing tables storing the results of 
casting decimal as strings. Running aggregation or joins over these tables can 
produce incorrect results.
    Since the motivation for the new behaviors is ANSI SQL compliance, the new 
behavior only exists in the ANSI SQL mode makes more sense.
    Users who enable the ANSI SQL mode are supposed to accept such a breaking 
change.
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, though not released yet, this PR makes the behavior of casting decimal 
values to string more reasonable by always using plain string under ANSI SQL 
mode.
    ### How was this patch tested?
    
    UT
    
    Closes #37221 from gengliangwang/ansiDecimalToString.
    
    Authored-by: Gengliang Wang <gengli...@apache.org>
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
---
 docs/sql-migration-guide.md                                   |  1 -
 docs/sql-ref-ansi-compliance.md                               |  5 +++--
 .../org/apache/spark/sql/catalyst/expressions/Cast.scala      | 11 ++++++++---
 .../main/scala/org/apache/spark/sql/internal/SQLConf.scala    | 11 -----------
 .../apache/spark/sql/catalyst/expressions/CastSuiteBase.scala |  8 --------
 .../spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala |  5 +++++
 .../spark/sql/catalyst/expressions/CastWithAnsiOnSuite.scala  |  5 +++++
 7 files changed, 21 insertions(+), 25 deletions(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 0d83ee3aeb3..75f7f6c9f8c 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -26,7 +26,6 @@ license: |
   
   - Since Spark 3.4, Number or Number(\*) from Teradata will be treated as 
Decimal(38,18). In Spark 3.3 or earlier, Number or Number(\*) from Teradata 
will be treated as Decimal(38, 0), in which case the fractional part will be 
removed.
   - Since Spark 3.4, v1 database, table, permanent view and function 
identifier will include 'spark_catalog' as the catalog name if database is 
defined, e.g. a table identifier will be: `spark_catalog.default.t`. To restore 
the legacy behavior, set `spark.sql.legacy.v1IdentifierNoCatalog` to `true`.
-  - Since Spark 3.4, the results of casting Decimal values as String type will 
not contain exponential notations. To restore the legacy behavior, which uses 
scientific notation if the adjusted exponent is less than -6, set 
`spark.sql.legacy.castDecimalToString.enabled` to `true`.
 
 ## Upgrading from Spark SQL 3.2 to 3.3
 
diff --git a/docs/sql-ref-ansi-compliance.md b/docs/sql-ref-ansi-compliance.md
index 6ad8210ed7e..65ed5caf833 100644
--- a/docs/sql-ref-ansi-compliance.md
+++ b/docs/sql-ref-ansi-compliance.md
@@ -81,7 +81,7 @@ Besides, the ANSI SQL mode disallows the following type 
conversions which are al
 
 | Source\Target | Numeric | String | Date | Timestamp | Interval | Boolean | 
Binary | Array | Map | Struct |
 
|-----------|---------|--------|------|-----------|----------|---------|--------|-------|-----|--------|
-| Numeric   | <span style="color:red">**Y**</span> | Y      | N    | <span 
style="color:red">**Y**</span>         | N      | Y       | N      | N     | N  
 | N      |
+| Numeric   | <span style="color:red">**Y**</span> | <span 
style="color:red">**Y**</span>      | N    | <span 
style="color:red">**Y**</span>         | N      | Y       | N      | N     | N  
 | N      |
 | String    | <span style="color:red">**Y**</span> | Y | <span 
style="color:red">**Y**</span> | <span style="color:red">**Y**</span> | <span 
style="color:red">**Y**</span> | <span style="color:red">**Y**</span> | Y | N   
  | N   | N      |
 | Date      | N       | Y      | Y    | Y         | N        | N       | N     
 | N     | N   | N      |
 | Timestamp | <span style="color:red">**Y**</span> | Y      | Y    | Y         
| N        | N       | N      | N     | N   | N      |
@@ -92,7 +92,7 @@ Besides, the ANSI SQL mode disallows the following type 
conversions which are al
 | Map       | N       | Y      | N    | N         | N        | N       | N     
 | N     | <span style="color:red">**Y**</span> | N      |
 | Struct    | N       | Y      | N    | N         | N        | N       | N     
 | N     | N   | <span style="color:red">**Y**</span> |
 
-In the table above, all the `CAST`s that can cause runtime exceptions are 
marked as red <span style="color:red">**Y**</span>:
+In the table above, all the `CAST`s with new syntax are marked as red <span 
style="color:red">**Y**</span>:
 * CAST(Numeric AS Numeric): raise an overflow exception if the value is out of 
the target data type's range.
 * CAST(String AS (Numeric/Date/Timestamp/Interval/Boolean)): raise a runtime 
exception if the value can't be parsed as the target data type.
 * CAST(Timestamp AS Numeric): raise an overflow exception if the number of 
seconds since epoch is out of the target data type's range.
@@ -100,6 +100,7 @@ In the table above, all the `CAST`s that can cause runtime 
exceptions are marked
 * CAST(Array AS Array): raise an exception if there is any on the conversion 
of the elements.
 * CAST(Map AS Map): raise an exception if there is any on the conversion of 
the keys and the values.
 * CAST(Struct AS Struct): raise an exception if there is any on the conversion 
of the struct fields.
+* CAST(Numeric AS String): Always use plain string representation on casting 
decimal values to strings, instead of using scientific notation if an exponent 
is needed
 
 ```sql
 -- Examples of explicit casting
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
index 5dd986e25e7..8afd3c13461 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
@@ -512,7 +512,6 @@ case class Cast(
     TimestampFormatter.getFractionFormatter(ZoneOffset.UTC)
 
   private val legacyCastToStr = 
SQLConf.get.getConf(SQLConf.LEGACY_COMPLEX_TYPES_TO_STRING)
-  private val legacyCastDecimalToStr = 
SQLConf.get.getConf(SQLConf.LEGACY_DECIMAL_TO_STRING)
   // The brackets that are used in casting structs and maps to strings
   private val (leftBracket, rightBracket) = if (legacyCastToStr) ("[", "]") 
else ("{", "}")
 
@@ -626,7 +625,10 @@ case class Cast(
     case DayTimeIntervalType(startField, endField) =>
       buildCast[Long](_, i => UTF8String.fromString(
         IntervalUtils.toDayTimeIntervalString(i, ANSI_STYLE, startField, 
endField)))
-    case _: DecimalType if !legacyCastDecimalToStr =>
+    // In ANSI mode, Spark always use plain string representation on casting 
Decimal values
+    // as strings. Otherwise, the casting is using `BigDecimal.toString` which 
may use scientific
+    // notation if an exponent is needed.
+    case _: DecimalType if ansiEnabled =>
       buildCast[Decimal](_, d => UTF8String.fromString(d.toPlainString))
     case _ => buildCast[Any](_, o => UTF8String.fromString(o.toString))
   }
@@ -1478,7 +1480,10 @@ case class Cast(
             $evPrim = UTF8String.fromString($iu.toDayTimeIntervalString($c, 
$style,
               (byte)${i.startField}, (byte)${i.endField}));
           """
-      case _: DecimalType if !legacyCastDecimalToStr =>
+      // In ANSI mode, Spark always use plain string representation on casting 
Decimal values
+      // as strings. Otherwise, the casting is using `BigDecimal.toString` 
which may use scientific
+      // notation if an exponent is needed.
+      case _: DecimalType if ansiEnabled =>
         (c, evPrim, _) => code"$evPrim = 
UTF8String.fromString($c.toPlainString());"
       case _ =>
         (c, evPrim, evNull) => code"$evPrim = 
UTF8String.fromString(String.valueOf($c));"
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 15bd5cb5e36..1b7857ead59 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -3697,17 +3697,6 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
-  val LEGACY_DECIMAL_TO_STRING =
-    buildConf("spark.sql.legacy.castDecimalToString.enabled")
-      .internal()
-      .doc("When true, casting decimal values as string will use scientific 
notation if an " +
-        "exponent is needed, which is the same with the method 
java.math.BigDecimal.toString(). " +
-        "Otherwise, the casting result won't contain an exponent field, which 
is compliant to " +
-        "the ANSI SQL standard.")
-      .version("3.4.0")
-      .booleanConf
-      .createWithDefault(false)
-
   val LEGACY_PATH_OPTION_BEHAVIOR =
     buildConf("spark.sql.legacy.pathOptionBehavior.enabled")
       .internal()
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala
index da9a7dca9f1..97cbc781829 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala
@@ -1305,12 +1305,4 @@ abstract class CastSuiteBase extends SparkFunSuite with 
ExpressionEvalHelper {
         Cast(child, DecimalType.USER_DEFAULT), it)
     }
   }
-
-  test("SPARK-39749: cast Decimal to string") {
-    val input = Literal.create(Decimal(0.000000123), DecimalType(9, 9))
-    checkEvaluation(cast(input, StringType), "0.000000123")
-    withSQLConf(SQLConf.LEGACY_DECIMAL_TO_STRING.key -> "true") {
-      checkEvaluation(cast(input, StringType), "1.23E-7")
-    }
-  }
 }
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala
index 56e586da2a3..a832b8ae3f5 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOffSuite.scala
@@ -603,6 +603,11 @@ class CastWithAnsiOffSuite extends CastSuiteBase {
     checkEvaluation(cast(Literal("2015-03-18T"), TimestampType), null)
   }
 
+  test("SPARK-39749: cast Decimal to string") {
+    val input = Literal.create(Decimal(0.000000123), DecimalType(9, 9))
+    checkEvaluation(cast(input, StringType), "1.23E-7")
+  }
+
   private def castOverflowErrMsg(targetType: DataType): String = {
     s"""cannot be cast to "${targetType.sql}" due to an overflow."""
   }
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOnSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOnSuite.scala
index f2cfc529984..89f68656507 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOnSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOnSuite.scala
@@ -576,4 +576,9 @@ class CastWithAnsiOnSuite extends CastSuiteBase with 
QueryErrorsBase {
         castErrMsg(invalidInput, TimestampNTZType))
     }
   }
+
+  test("SPARK-39749: cast Decimal to string") {
+    val input = Literal.create(Decimal(0.000000123), DecimalType(9, 9))
+    checkEvaluation(cast(input, StringType), "0.000000123")
+  }
 }


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-39749][SQL][FOLLOWUP] Move the new behavior of CAST(DECIMAL AS STRING) under ANSI SQL mode

Reply via email to