[ https://issues.apache.org/jira/browse/SPARK-34441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285618#comment-17285618 ]
Jean-Francis Roy edited comment on SPARK-34441 at 2/17/21, 3:17 AM: -------------------------------------------------------------------- [~hyukjin.kwon] of course, here is an example : {code:java} scala> case class Foo(a: String) scala> val ds = List("", "{", "{}", """{"a"}""", """{"a": "bar"}""", """{"a": 42}""").toDS scala> import org.apache.spark.sql.types._ scala> ds.withColumn("converted", from_json($"value", StructType(Array(StructField("a", StringType))))).show() +------------+---------+ | value|converted| +------------+---------+ | | null| | {| []| | {}| []| | {"a"}| []| |{"a": "bar"}| [bar]| | {"a": 42}| [42]| +------------+---------+{code} We see above that faulty JSON will often result in a structure with `null` fields instead of a `null` directly, which is a big change of behavior between Spark 2 and Spark 3. The documentation still states that the behavior is Spark 2's. Moreover, I cannot reproduce Spark 2's behavior. I do want faulty input to be converted to null. I can make the code throw using the `FAILFAST` mode: {code:java} scala> ds.withColumn("converted", from_json($"value", StructType(Array(StructField("a", StringType))), Map("mode" -> "FAILFAST"))).show() {code} But I cannot use the `DROPMALFORMED` mode as it is not supported: {code:java} scala> ds.withColumn("converted", from_json($"value", StructType(Array(StructField("a", StringType))), Map("mode" -> "DROPMALFORMED"))).show() java.lang.IllegalArgumentException: from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are PERMISSIVE and FAILFAST. {code} was (Author: jeanfrancisroy): [~hyukjin.kwon] of course, here is an example : {code:java} scala> case class Foo(a: String) scala> val ds = List("", "{", "{}", """{"a"}""", """{"a": "bar"}""", """{"a": 42}""").toDS scala> import org.apache.spark.sql.types._ scala> ds.withColumn("converted", from_json($"value", StructType(Array(StructField("a", StringType))))).show() +------------+---------+ | value|converted| +------------+---------+ | | null| | {| []| | {}| []| | {"a"}| []| |{"a": "bar"}| [bar]| | {"a": 42}| [42]| +------------+---------+{code} We see above that faulty JSON will often result in a structure with `null` fields instead of a `null` directly, which is a big change of behavior between Spark 2 and Spark 3. The documentation still states that the behavior is Spark 2's. Moreover, I cannot reproduce Spark 2's behavior. I do want faulty input to be converted to null. I can make the code throw using the `FAILFAST` mode: {code:java} scala> ds.withColumn("converted", from_json($"value", StructType(Array(StructField("a", StringType))), Map("mode" -> "FAILFAST"))).show() {code} But I cannot use the `DROPMALFORMED` mode as it is not supported: scala> ds.withColumn("converted", from_json($"value", StructType(Array(StructField("a", StringType))), Map("mode" -> "DROPMALFORMED"))).show() java.lang.IllegalArgumentException: from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are PERMISSIVE and FAILFAST. > from_json documentation is wrong about malformed JSONs output > ------------------------------------------------------------- > > Key: SPARK-34441 > URL: https://issues.apache.org/jira/browse/SPARK-34441 > Project: Spark > Issue Type: Documentation > Components: Documentation > Affects Versions: 3.0.0, 3.0.1 > Reporter: Jean-Francis Roy > Priority: Minor > > The documentation of the `from_json` function states that malformed json will > return a `null` value, which is not the case anymore after > https://issues.apache.org/jira/browse/SPARK-25243. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org