Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/21909#discussion_r206059735
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -450,7 +450,8 @@ class DataFrameReader private[sql](sparkSession:
SparkSession) extends Logging {
input => rawParser.parse(input, createParser,
UTF8String.fromString),
parsedOptions.parseMode,
schema,
- parsedOptions.columnNameOfCorruptRecord)
+ parsedOptions.columnNameOfCorruptRecord,
+ optimizeEmptySchema = true)
--- End diff --
> what would be the case to turn it off?
We can apply the optimization if we know in advance that one JSON object
corresponds to one struct. In that case, we can return empty row if required
schema (struct) is empty. If `multiLine` is enabled, there could be many
structs per a JSON document. So, we cannot say in advance how many empty rows
need to return without parsing.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]