HyukjinKwon commented on a change in pull request #23665: [SPARK-26745][SQL]
Skip empty lines in JSON-derived DataFrames when skipParsing optimization in
effect
URL: https://github.com/apache/spark/pull/23665#discussion_r251658839
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
##########
@@ -55,11 +56,15 @@ class FailureSafeParser[IN](
def parse(input: IN): Iterator[InternalRow] = {
try {
- if (skipParsing) {
- Iterator.single(InternalRow.empty)
- } else {
- rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), ()
=> null))
- }
+ if (skipParsing) {
+ if (unparsedRecordIsNonEmpty(input)) {
Review comment:
re: https://github.com/apache/spark/pull/23665#discussion_r251508679
This is reasoning about the count. (see below, continues)
> I think for permissive mode, the results(at least the counts) are always
same even if some input are malformed?
To be 100% about the correct results, we should always parse everything
although we're doing the current way for optimization and it started to have
some inconsistent results.
Yes, so we don't convert 100%. In that case, we should at least parse
`StructType()` which I guess empty object `{...}`. I think that's what JSON did
before the pointed PRs above.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]