HyukjinKwon commented on a change in pull request #23665: [SPARK-26745][SQL]
Skip empty lines in JSON-derived DataFrames when skipParsing optimization in
effect
URL: https://github.com/apache/spark/pull/23665#discussion_r251658839
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
##########
@@ -55,11 +56,15 @@ class FailureSafeParser[IN](
def parse(input: IN): Iterator[InternalRow] = {
try {
- if (skipParsing) {
- Iterator.single(InternalRow.empty)
- } else {
- rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), ()
=> null))
- }
+ if (skipParsing) {
+ if (unparsedRecordIsNonEmpty(input)) {
Review comment:
re: https://github.com/apache/spark/pull/23665#discussion_r251508679
This is reasoning about the count. (see below, continues)
> I think for permissive mode, the results(at least the counts) are always
same even if some input are malformed?
To be 100% about the correct results, we should always parse everything
although we're doing the current way for optimization and it started to have
some inconsistent results.
Yes, so we don't convert 100%. In that case, we should at least parse
`StructType()` which I guess empty object `{...}`. I think that's what JSON did
before the pointed PRs above.
> Otherwise, it seems like users only want to count the number of lines, and
they should read the json files as text and do count.
Yes, I agree. It shouldn't behaves like text source + count(). Let's revert
anyway. I don't think this behaviour is ideal anyway.
For other behaviours, I was thinking about making a `README.md` that
whitelists behaviours for both CSV and JSON for Spark developers under
somewhere related JSON and CSV directory. It's a bit grunting job but sounds
like it should be done. I could do this together @MaxGekk since he has worked
on this area a lot as well.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]