Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20894#discussion_r189062745
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -497,6 +498,11 @@ class DataFrameReader private[sql](sparkSession:
SparkSession) extends Logging {
StructType(schema.filterNot(_.name ==
parsedOptions.columnNameOfCorruptRecord))
val linesWithoutHeader: RDD[String] = maybeFirstLine.map { firstLine =>
+ if (!parsedOptions.enforceSchema) {
+ CSVDataSource.checkHeader(firstLine, new
CsvParser(parsedOptions.asParserSettings),
--- End diff --
I mean we could, for example, make a dataset from
spark.read.text("tmp/*.csv"), preprocess it and then convert it via
spark.read.csv(dataset). In this case, every file would have the header. This
doesn't validate each file's header.
Shall we document this if it's hard to fix?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]