HyukjinKwon commented on a change in pull request #23665: [SPARK-26745][SQL]
Skip empty lines in JSON-derived DataFrames when skipParsing optimization in
effect
URL: https://github.com/apache/spark/pull/23665#discussion_r251373443
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
##########
@@ -55,11 +56,15 @@ class FailureSafeParser[IN](
def parse(input: IN): Iterator[InternalRow] = {
try {
- if (skipParsing) {
- Iterator.single(InternalRow.empty)
- } else {
- rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), ()
=> null))
- }
+ if (skipParsing) {
+ if (unparsedRecordIsNonEmpty(input)) {
Review comment:
I agree with that our JSON / CSV have some holes about parsing, counting,
column pruning (in particular when it's malformed), etc. I was thinking about
whitelisting the problems and behaviours at 3.0.0.
I don't think this behaviour was noticed to reviewers when #21909 was merged
at that time while this is pretty critical to discuss as you said above. Let's
do this again probably with whitelisting behaviours clearly. I think we should
avoid this forth and back with the doc.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]