HyukjinKwon commented on a change in pull request #23665: [SPARK-26745][SQL] 
Skip empty lines in JSON-derived DataFrames when skipParsing optimization in 
effect
URL: https://github.com/apache/spark/pull/23665#discussion_r251276720
 
 

 ##########
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
 ##########
 @@ -55,11 +56,15 @@ class FailureSafeParser[IN](
 
   def parse(input: IN): Iterator[InternalRow] = {
     try {
-     if (skipParsing) {
-       Iterator.single(InternalRow.empty)
-     } else {
-       rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () 
=> null))
-     }
+      if (skipParsing) {
+        if (unparsedRecordIsNonEmpty(input)) {
 
 Review comment:
   I think we should rather revert https://github.com/apache/spark/pull/21909. 
I think https://github.com/apache/spark/pull/21909 was a bandaid fix and this 
is another bandaid fix for that.
   
   `JacksonParser` itself can produce no record or multiple records. Previous 
code path assumed that it always produce a single record, and the current fix 
it checked the input again outside of `JacksonParser`.
   
   There is another problem from https://github.com/apache/spark/pull/21909 . 
It also looks going to produce incorrect counts when the input json is an array:
   
   ```bash
   $ cat tmp.json
   [{"a": 1}, {"a": 2}]
   ```
   
   Current master:
   
   ```scala
   scala> spark.read.json("tmp.json").show()
   +---+
   |  a|
   +---+
   |  1|
   |  2|
   +---+
   
   
   scala> spark.read.json("tmp.json").count()
   res1: Long = 1
   ```
   
   Spark 2.3.1:
   
   ```scala
   scala> spark.read.json("tmp.json").show()
   +---+
   |  a|
   +---+
   |  1|
   |  2|
   +---+
   
   
   scala> spark.read.json("tmp.json").count()
   res1: Long = 2
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to