sumitsu commented on a change in pull request #23665: [SPARK-26745][SQL] Skip
empty lines in JSON-derived DataFrames when skipParsing optimization in effect
URL: https://github.com/apache/spark/pull/23665#discussion_r251273243
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
##########
@@ -55,11 +56,15 @@ class FailureSafeParser[IN](
def parse(input: IN): Iterator[InternalRow] = {
try {
- if (skipParsing) {
- Iterator.single(InternalRow.empty)
- } else {
- rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), ()
=> null))
- }
+ if (skipParsing) {
+ if (unparsedRecordIsNonEmpty(input)) {
+ Iterator.single(InternalRow.empty)
+ } else {
+ Iterator.empty
+ }
+ } else {
+ rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), ()
=> null))
+ }
Review comment:
One of my earlier revisions [worked in the way you
suggest](https://github.com/sumitsu/spark/blob/13942b8da17d7e58e690b7a8b27c1a3fe20be135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L132-L160)
(if I've understood your point correctly); I changed it in order to avoid
redundant empty-line filtering in the case where the full parser has to run
anyway (i.e. where `skipParsing == false`).
What do you think? Is that a valid optimization, or is it better to do it on
the JSON side as in 13942b8 to avoid changes to `FailureSafeParser`?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]