[GitHub] spark pull request #20894: [SPARK-23786][SQL] Checking column names of csv h...

HyukjinKwon Thu, 17 May 2018 11:46:03 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20894#discussion_r189062745
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
    @@ -497,6 +498,11 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
           StructType(schema.filterNot(_.name == 
parsedOptions.columnNameOfCorruptRecord))
     
         val linesWithoutHeader: RDD[String] = maybeFirstLine.map { firstLine =>
    +      if (!parsedOptions.enforceSchema) {
    +        CSVDataSource.checkHeader(firstLine, new 
CsvParser(parsedOptions.asParserSettings),
    --- End diff --
    
    I mean we could, for example, make a dataset from 
spark.read.text("tmp/*.csv"), preprocess it and then convert it via 
spark.read.csv(dataset). In this case, every file would have the header. This 
doesn't validate each file's header.
    
    Shall we document this if it's hard to fix?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20894: [SPARK-23786][SQL] Checking column names of csv h...

Reply via email to