Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/20894#discussion_r189103522
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -497,6 +498,11 @@ class DataFrameReader private[sql](sparkSession:
SparkSession) extends Logging {
StructType(schema.filterNot(_.name ==
parsedOptions.columnNameOfCorruptRecord))
val linesWithoutHeader: RDD[String] = maybeFirstLine.map { firstLine =>
+ if (!parsedOptions.enforceSchema) {
+ CSVDataSource.checkHeader(firstLine, new
CsvParser(parsedOptions.asParserSettings),
--- End diff --
Thank you for explaining of the use case. I didn't realize that the input
dataset can contain multiple headers. So, potentially the headers can contain
column names in different order or even incorrect column names. It seems pretty
tough to fix that case.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]