[GitHub] spark pull request #20894: [SPARK-23786][SQL] Checking column names of csv h...

MaxGekk Thu, 17 May 2018 14:16:05 -0700

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20894#discussion_r189103522
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
    @@ -497,6 +498,11 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
           StructType(schema.filterNot(_.name == 
parsedOptions.columnNameOfCorruptRecord))
     
         val linesWithoutHeader: RDD[String] = maybeFirstLine.map { firstLine =>
    +      if (!parsedOptions.enforceSchema) {
    +        CSVDataSource.checkHeader(firstLine, new 
CsvParser(parsedOptions.asParserSettings),
    --- End diff --
    
    Thank you for explaining of the use case. I didn't realize that the input 
dataset can contain multiple headers. So, potentially the headers can contain 
column names in different order or even incorrect column names. It seems pretty 
tough to fix that case.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20894: [SPARK-23786][SQL] Checking column names of csv h...

Reply via email to