[spark] branch master updated: [SPARK-38060][SQL] Respect allowNonNumericNumbers when parsing quoted NaN and Infinity values in JSON reader

srowen Tue, 22 Feb 2022 06:44:04 -0800

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 43822cd  [SPARK-38060][SQL] Respect allowNonNumericNumbers when 
parsing quoted NaN and Infinity values in JSON reader
43822cd is described below

commit 43822cdd228a3ba49c47637c525d731d00772f64
Author: Andy Grove <andygrov...@gmail.com>
AuthorDate: Tue Feb 22 08:42:47 2022 -0600

    [SPARK-38060][SQL] Respect allowNonNumericNumbers when parsing quoted NaN 
and Infinity values in JSON reader
    
    Signed-off-by: Andy Grove <andygrove73gmail.com>
    
    ### What changes were proposed in this pull request?
    
    When parsing JSON unquoted `NaN` and `Infinity`values for floating-point 
columns we get the expected behavior as shown below where valid values are 
returned when the parsing option `allowNonNumericNumbers` is enabled and `null` 
otherwise.
    
    | Value     | allowNonNumericNumbers=true | allowNonNumericNumbers=false |
    | --------- | --------------------------- | ---------------------------- |
    | NaN       | Double.NaN                  | null                         |
    | +INF      | Double.PositiveInfinity     | null                         |
    | +Infinity | Double.PositiveInfinity     | null                         |
    | Infinity  | Double.PositiveInfinity     | null                         |
    | -INF      | Double.NegativeInfinity     | null                         |
    | -Infinity | Double.NegativeInfinity     | null                         |
    
    However, when these values are quoted we get the following unexpected 
behavior due to a different code path being used that is inconsistent with 
Jackson's parsing and that ignores the `allowNonNumericNumbers` parser option.
    
    | Value       | allowNonNumericNumbers=true | allowNonNumericNumbers=false |
    | ----------- | --------------------------- | ---------------------------- |
    | "NaN"       | Double.NaN                  | Double.NaN                   |
    | "+INF"      | null                        | null                         |
    | "+Infinity" | null                        | null                         |
    | "Infinity"  | Double.PositiveInfinity     | Double.PositiveInfinity      |
    | "-INF"      | null                        | null                         |
    | "-Infinity" | Double.NegativeInfinity     | Double.NegativeInfinity      |
    
    This PR updates the code path that handles quoted non-numeric numbers to 
make it consistent with the path that handles the unquoted values.
    
    ### Why are the changes needed?
    
    The current behavior does not match the documented behavior in 
https://spark.apache.org/docs/latest/sql-data-sources-json.html
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, parsing of quoted `NaN` and `Infinity` values will now be consistent 
with the unquoted versions.
    
    ### How was this patch tested?
    
    Unit tests are updated.
    
    Closes #35573 from andygrove/SPARK-38060.
    
    Authored-by: Andy Grove <andygrov...@gmail.com>
    Signed-off-by: Sean Owen <sro...@gmail.com>
---
 docs/core-migration-guide.md                       |  2 ++
 .../spark/sql/catalyst/json/JacksonParser.scala    | 18 ++++++----
 .../datasources/json/JsonParsingOptionsSuite.scala | 39 ++++++++++++++++++++++
 .../sql/execution/datasources/json/JsonSuite.scala |  6 ++++
 4 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/docs/core-migration-guide.md b/docs/core-migration-guide.md
index 745b80d..588433c 100644
--- a/docs/core-migration-guide.md
+++ b/docs/core-migration-guide.md
@@ -26,6 +26,8 @@ license: |
 
 - Since Spark 3.3, Spark migrates its log4j dependency from 1.x to 2.x because 
log4j 1.x has reached end of life and is no longer supported by the community. 
Vulnerabilities reported after August 2015 against log4j 1.x were not checked 
and will not be fixed. Users should rewrite original log4j properties files 
using log4j2 syntax (XML, JSON, YAML, or properties format). Spark rewrites the 
`conf/log4j.properties.template` which is included in Spark distribution, to 
`conf/log4j2.properties [...]
 
+- Since Spark 3.3, when reading values from a JSON attribute defined as 
`FloatType` or `DoubleType`, the strings `"+Infinity"`, `"+INF"`, and `"-INF"` 
are now parsed to the appropriate values, in addition to the already supported 
`"Infinity"` and `"-Infinity"` variations. This change was made to improve 
consistency with Jackson's parsing of the unquoted versions of these values. 
Also, the `allowNonNumericNumbers` option is now respected so these strings 
will now be considered invalid if  [...]
+
 ## Upgrading from Core 3.1 to 3.2
 
 - Since Spark 3.2, `spark.scheduler.allocation.file` supports read remote file 
using hadoop filesystem which means if the path has no scheme Spark will 
respect hadoop configuration to read it. To restore the behavior before Spark 
3.2, you can specify the local scheme for `spark.scheduler.allocation.file` 
e.g. `file:///path/to/file`.
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
index a1f9487..abcbdb8 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
@@ -204,9 +204,12 @@ class JacksonParser(
         case VALUE_STRING if parser.getTextLength >= 1 =>
           // Special case handling for NaN and Infinity.
           parser.getText match {
-            case "NaN" => Float.NaN
-            case "Infinity" => Float.PositiveInfinity
-            case "-Infinity" => Float.NegativeInfinity
+            case "NaN" if options.allowNonNumericNumbers =>
+              Float.NaN
+            case "+INF" | "+Infinity" | "Infinity" if 
options.allowNonNumericNumbers =>
+              Float.PositiveInfinity
+            case "-INF" | "-Infinity" if options.allowNonNumericNumbers =>
+              Float.NegativeInfinity
             case _ => throw 
QueryExecutionErrors.cannotParseStringAsDataTypeError(
               parser, VALUE_STRING, FloatType)
           }
@@ -220,9 +223,12 @@ class JacksonParser(
         case VALUE_STRING if parser.getTextLength >= 1 =>
           // Special case handling for NaN and Infinity.
           parser.getText match {
-            case "NaN" => Double.NaN
-            case "Infinity" => Double.PositiveInfinity
-            case "-Infinity" => Double.NegativeInfinity
+            case "NaN" if options.allowNonNumericNumbers =>
+              Double.NaN
+            case "+INF" | "+Infinity" | "Infinity" if 
options.allowNonNumericNumbers =>
+              Double.PositiveInfinity
+            case "-INF" | "-Infinity" if options.allowNonNumericNumbers =>
+              Double.NegativeInfinity
             case _ => throw 
QueryExecutionErrors.cannotParseStringAsDataTypeError(
               parser, VALUE_STRING, DoubleType)
           }
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala
index e9fe79a..703085d 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala
@@ -130,6 +130,45 @@ class JsonParsingOptionsSuite extends QueryTest with 
SharedSparkSession {
         Double.NegativeInfinity, Double.NegativeInfinity))
   }
 
+  test("allowNonNumericNumbers on - quoted") {
+    val str =
+      """{"c0":"NaN", "c1":"+INF", "c2":"+Infinity", "c3":"Infinity", 
"c4":"-INF",
+        |"c5":"-Infinity"}""".stripMargin
+    val df = spark.read
+      .schema(new StructType()
+        .add("c0", "double")
+        .add("c1", "double")
+        .add("c2", "double")
+        .add("c3", "double")
+        .add("c4", "double")
+        .add("c5", "double"))
+      .option("allowNonNumericNumbers", true).json(Seq(str).toDS())
+    checkAnswer(
+      df,
+      Row(
+        Double.NaN,
+        Double.PositiveInfinity, Double.PositiveInfinity, 
Double.PositiveInfinity,
+        Double.NegativeInfinity, Double.NegativeInfinity))
+  }
+
+  test("allowNonNumericNumbers off - quoted") {
+    val str =
+      """{"c0":"NaN", "c1":"+INF", "c2":"+Infinity", "c3":"Infinity", 
"c4":"-INF",
+        |"c5":"-Infinity"}""".stripMargin
+    val df = spark.read
+      .schema(new StructType()
+        .add("c0", "double")
+        .add("c1", "double")
+        .add("c2", "double")
+        .add("c3", "double")
+        .add("c4", "double")
+        .add("c5", "double"))
+      .option("allowNonNumericNumbers", false).json(Seq(str).toDS())
+    checkAnswer(
+      df,
+      Row(null, null, null, null, null, null))
+  }
+
   test("allowBackslashEscapingAnyCharacter off") {
     val str = """{"name": "Cazen Lee", "price": "\$10"}"""
     val df = spark.read.option("allowBackslashEscapingAnyCharacter", 
"false").json(Seq(str).toDS())
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
index 3daad30..bd01975 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
@@ -2021,13 +2021,19 @@ abstract class JsonSuite
   test("SPARK-18772: Parse special floats correctly") {
     val jsons = Seq(
       """{"a": "NaN"}""",
+      """{"a": "+INF"}""",
+      """{"a": "-INF"}""",
       """{"a": "Infinity"}""",
+      """{"a": "+Infinity"}""",
       """{"a": "-Infinity"}""")
 
     // positive cases
     val checks: Seq[Double => Boolean] = Seq(
       _.isNaN,
       _.isPosInfinity,
+      _.isNegInfinity,
+      _.isPosInfinity,
+      _.isPosInfinity,
       _.isNegInfinity)
 
     Seq(FloatType, DoubleType).foreach { dt =>

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-38060][SQL] Respect allowNonNumericNumbers when parsing quoted NaN and Infinity values in JSON reader

Reply via email to