HyukjinKwon commented on a change in pull request #23665: [SPARK-26745][SQL] 
Skip empty lines in JSON-derived DataFrames when skipParsing optimization in 
effect
URL: https://github.com/apache/spark/pull/23665#discussion_r251658839
 
 

 ##########
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
 ##########
 @@ -55,11 +56,15 @@ class FailureSafeParser[IN](
 
   def parse(input: IN): Iterator[InternalRow] = {
     try {
-     if (skipParsing) {
-       Iterator.single(InternalRow.empty)
-     } else {
-       rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () 
=> null))
-     }
+      if (skipParsing) {
+        if (unparsedRecordIsNonEmpty(input)) {
 
 Review comment:
   re: https://github.com/apache/spark/pull/23665#discussion_r251508679
   
   This is reasoning about the count. (see below, continues)
   
   > I think for permissive mode, the results(at least the counts) are always 
same even if some input are malformed? 
   
   To be 100% about the correct results, we should always parse everything 
although we're doing the current way for optimization and it started to have 
some inconsistent results.
   
   Yes, so we don't convert 100%. In that case, we should at least parse 
`StructType()` which I guess empty object `{...}`. I think that's what JSON did 
before the pointed PRs above.
   
   > Otherwise, it seems like users only want to count the number of lines, and 
they should read the json files as text and do count.
   
   Yes, I agree. It shouldn't behaves like text source + count(). Let's revert 
anyway. I don't think this behaviour is ideal anyway.
   
   For other behaviours, I was thinking about making a `README.md` that 
whitelists behaviours for both CSV and JSON for Spark developers under 
somewhere related JSON and CSV directory. It's a bit grunting job but sounds 
like it should be done. I could do this together @MaxGekk too.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to