cloud-fan commented on a change in pull request #23665: [SPARK-26745][SQL] Skip 
empty lines in JSON-derived DataFrames when skipParsing optimization in effect
URL: https://github.com/apache/spark/pull/23665#discussion_r251502380
 
 

 ##########
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
 ##########
 @@ -55,11 +56,15 @@ class FailureSafeParser[IN](
 
   def parse(input: IN): Iterator[InternalRow] = {
     try {
-     if (skipParsing) {
-       Iterator.single(InternalRow.empty)
-     } else {
-       rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () 
=> null))
-     }
+      if (skipParsing) {
+        if (unparsedRecordIsNonEmpty(input)) {
 
 Review comment:
   ```
   [{...}, {...}] => 2
   [] => 0
   {...} => 1
   # empty string => 0
   ```
   I think the key here is, one line can produce 0 or 1 or more records, how to 
speed it up when we only care about counts? It looks to me that we can enable 
the count optimization only for `{...}`, and fallback to parsing for other 
cases.  @MaxGekk do you think this is applicable? If it's a simple fix, let's 
do it for branch 2.4 as well, otherwise +1 for reverting it from 2.4 and re-do 
it at master.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to