HyukjinKwon commented on a change in pull request #23665: [SPARK-26745][SQL]
Skip empty lines in JSON-derived DataFrames when skipParsing optimization in
effect
URL: https://github.com/apache/spark/pull/23665#discussion_r251482039
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
##########
@@ -55,11 +56,15 @@ class FailureSafeParser[IN](
def parse(input: IN): Iterator[InternalRow] = {
try {
- if (skipParsing) {
- Iterator.single(InternalRow.empty)
- } else {
- rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), ()
=> null))
- }
+ if (skipParsing) {
+ if (unparsedRecordIsNonEmpty(input)) {
Review comment:
Like, how are we going to explain this to users? When you count, the record
is separated by newlines but other operations recognise JSON array and empty
strings .. , and count is a special case? CSV doesn't do that fwiw - it filters
empty strings. I think we shouldn't do reasoning in particular for such basic
operations.
If I were a user, I would think it's counter initiative. Is it me alone
thinking in this way? This case specifically doesn't look straightforward. I
know there are issues about parsing stuff and we should define the behaviour
but this case looks not.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]