[ https://issues.apache.org/jira/browse/SPARK-42118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17680419#comment-17680419 ]
Hyukjin Kwon commented on SPARK-42118: -------------------------------------- As a workaround you can do: {code} val df = spark.read.format("json").option("multiLine", true).load("/tmp/json") val newDF = spark.createDataFrame(df.rdd, df.schema) {code} then {code} df.show(false) df.count {code} will show the consistent output. > Wrong result when parsing a multiline JSON file with differing types for same > column > ------------------------------------------------------------------------------------ > > Key: SPARK-42118 > URL: https://issues.apache.org/jira/browse/SPARK-42118 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.2.1 > Reporter: Dilip Biswal > Priority: Major > > Here is a simple reproduction of the problem. We have a JSON file whose > content looks like following and is in multiLine format. > {code} > [{"name":""},{"name":123.34}] > {code} > Here is the result of spark query when we read the above content. > scala> val df = spark.read.format("json").option("multiLine", > true).load("/tmp/json") > df: org.apache.spark.sql.DataFrame = [name: double] > scala> df.show(false) > +----+ > |name| > +----+ > |null| > +----+ > scala> df.count > res5: Long = 2 > This is quite a serious problem for us as it's causing us to master corrupt > data in lake. If there is some issue with parsing the input, we expect spark > set the "_corrupt_record" so that we can act on it. Please note that df.count > is reporting 2 rows where as df.show only reports 1 row with null value. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org