This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new 43822cd [SPARK-38060][SQL] Respect allowNonNumericNumbers when parsing quoted NaN and Infinity values in JSON reader 43822cd is described below commit 43822cdd228a3ba49c47637c525d731d00772f64 Author: Andy Grove <andygrov...@gmail.com> AuthorDate: Tue Feb 22 08:42:47 2022 -0600 [SPARK-38060][SQL] Respect allowNonNumericNumbers when parsing quoted NaN and Infinity values in JSON reader Signed-off-by: Andy Grove <andygrove73gmail.com> ### What changes were proposed in this pull request? When parsing JSON unquoted `NaN` and `Infinity`values for floating-point columns we get the expected behavior as shown below where valid values are returned when the parsing option `allowNonNumericNumbers` is enabled and `null` otherwise. | Value | allowNonNumericNumbers=true | allowNonNumericNumbers=false | | --------- | --------------------------- | ---------------------------- | | NaN | Double.NaN | null | | +INF | Double.PositiveInfinity | null | | +Infinity | Double.PositiveInfinity | null | | Infinity | Double.PositiveInfinity | null | | -INF | Double.NegativeInfinity | null | | -Infinity | Double.NegativeInfinity | null | However, when these values are quoted we get the following unexpected behavior due to a different code path being used that is inconsistent with Jackson's parsing and that ignores the `allowNonNumericNumbers` parser option. | Value | allowNonNumericNumbers=true | allowNonNumericNumbers=false | | ----------- | --------------------------- | ---------------------------- | | "NaN" | Double.NaN | Double.NaN | | "+INF" | null | null | | "+Infinity" | null | null | | "Infinity" | Double.PositiveInfinity | Double.PositiveInfinity | | "-INF" | null | null | | "-Infinity" | Double.NegativeInfinity | Double.NegativeInfinity | This PR updates the code path that handles quoted non-numeric numbers to make it consistent with the path that handles the unquoted values. ### Why are the changes needed? The current behavior does not match the documented behavior in https://spark.apache.org/docs/latest/sql-data-sources-json.html ### Does this PR introduce _any_ user-facing change? Yes, parsing of quoted `NaN` and `Infinity` values will now be consistent with the unquoted versions. ### How was this patch tested? Unit tests are updated. Closes #35573 from andygrove/SPARK-38060. Authored-by: Andy Grove <andygrov...@gmail.com> Signed-off-by: Sean Owen <sro...@gmail.com> --- docs/core-migration-guide.md | 2 ++ .../spark/sql/catalyst/json/JacksonParser.scala | 18 ++++++---- .../datasources/json/JsonParsingOptionsSuite.scala | 39 ++++++++++++++++++++++ .../sql/execution/datasources/json/JsonSuite.scala | 6 ++++ 4 files changed, 59 insertions(+), 6 deletions(-) diff --git a/docs/core-migration-guide.md b/docs/core-migration-guide.md index 745b80d..588433c 100644 --- a/docs/core-migration-guide.md +++ b/docs/core-migration-guide.md @@ -26,6 +26,8 @@ license: | - Since Spark 3.3, Spark migrates its log4j dependency from 1.x to 2.x because log4j 1.x has reached end of life and is no longer supported by the community. Vulnerabilities reported after August 2015 against log4j 1.x were not checked and will not be fixed. Users should rewrite original log4j properties files using log4j2 syntax (XML, JSON, YAML, or properties format). Spark rewrites the `conf/log4j.properties.template` which is included in Spark distribution, to `conf/log4j2.properties [...] +- Since Spark 3.3, when reading values from a JSON attribute defined as `FloatType` or `DoubleType`, the strings `"+Infinity"`, `"+INF"`, and `"-INF"` are now parsed to the appropriate values, in addition to the already supported `"Infinity"` and `"-Infinity"` variations. This change was made to improve consistency with Jackson's parsing of the unquoted versions of these values. Also, the `allowNonNumericNumbers` option is now respected so these strings will now be considered invalid if [...] + ## Upgrading from Core 3.1 to 3.2 - Since Spark 3.2, `spark.scheduler.allocation.file` supports read remote file using hadoop filesystem which means if the path has no scheme Spark will respect hadoop configuration to read it. To restore the behavior before Spark 3.2, you can specify the local scheme for `spark.scheduler.allocation.file` e.g. `file:///path/to/file`. diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala index a1f9487..abcbdb8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala @@ -204,9 +204,12 @@ class JacksonParser( case VALUE_STRING if parser.getTextLength >= 1 => // Special case handling for NaN and Infinity. parser.getText match { - case "NaN" => Float.NaN - case "Infinity" => Float.PositiveInfinity - case "-Infinity" => Float.NegativeInfinity + case "NaN" if options.allowNonNumericNumbers => + Float.NaN + case "+INF" | "+Infinity" | "Infinity" if options.allowNonNumericNumbers => + Float.PositiveInfinity + case "-INF" | "-Infinity" if options.allowNonNumericNumbers => + Float.NegativeInfinity case _ => throw QueryExecutionErrors.cannotParseStringAsDataTypeError( parser, VALUE_STRING, FloatType) } @@ -220,9 +223,12 @@ class JacksonParser( case VALUE_STRING if parser.getTextLength >= 1 => // Special case handling for NaN and Infinity. parser.getText match { - case "NaN" => Double.NaN - case "Infinity" => Double.PositiveInfinity - case "-Infinity" => Double.NegativeInfinity + case "NaN" if options.allowNonNumericNumbers => + Double.NaN + case "+INF" | "+Infinity" | "Infinity" if options.allowNonNumericNumbers => + Double.PositiveInfinity + case "-INF" | "-Infinity" if options.allowNonNumericNumbers => + Double.NegativeInfinity case _ => throw QueryExecutionErrors.cannotParseStringAsDataTypeError( parser, VALUE_STRING, DoubleType) } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala index e9fe79a..703085d 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala @@ -130,6 +130,45 @@ class JsonParsingOptionsSuite extends QueryTest with SharedSparkSession { Double.NegativeInfinity, Double.NegativeInfinity)) } + test("allowNonNumericNumbers on - quoted") { + val str = + """{"c0":"NaN", "c1":"+INF", "c2":"+Infinity", "c3":"Infinity", "c4":"-INF", + |"c5":"-Infinity"}""".stripMargin + val df = spark.read + .schema(new StructType() + .add("c0", "double") + .add("c1", "double") + .add("c2", "double") + .add("c3", "double") + .add("c4", "double") + .add("c5", "double")) + .option("allowNonNumericNumbers", true).json(Seq(str).toDS()) + checkAnswer( + df, + Row( + Double.NaN, + Double.PositiveInfinity, Double.PositiveInfinity, Double.PositiveInfinity, + Double.NegativeInfinity, Double.NegativeInfinity)) + } + + test("allowNonNumericNumbers off - quoted") { + val str = + """{"c0":"NaN", "c1":"+INF", "c2":"+Infinity", "c3":"Infinity", "c4":"-INF", + |"c5":"-Infinity"}""".stripMargin + val df = spark.read + .schema(new StructType() + .add("c0", "double") + .add("c1", "double") + .add("c2", "double") + .add("c3", "double") + .add("c4", "double") + .add("c5", "double")) + .option("allowNonNumericNumbers", false).json(Seq(str).toDS()) + checkAnswer( + df, + Row(null, null, null, null, null, null)) + } + test("allowBackslashEscapingAnyCharacter off") { val str = """{"name": "Cazen Lee", "price": "\$10"}""" val df = spark.read.option("allowBackslashEscapingAnyCharacter", "false").json(Seq(str).toDS()) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala index 3daad30..bd01975 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala @@ -2021,13 +2021,19 @@ abstract class JsonSuite test("SPARK-18772: Parse special floats correctly") { val jsons = Seq( """{"a": "NaN"}""", + """{"a": "+INF"}""", + """{"a": "-INF"}""", """{"a": "Infinity"}""", + """{"a": "+Infinity"}""", """{"a": "-Infinity"}""") // positive cases val checks: Seq[Double => Boolean] = Seq( _.isNaN, _.isPosInfinity, + _.isNegInfinity, + _.isPosInfinity, + _.isPosInfinity, _.isNegInfinity) Seq(FloatType, DoubleType).foreach { dt => --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org