[GitHub] [spark] MaxGekk commented on issue #24757: [SPARK-27873][SQL] columnNameOfCorruptRecord should not be checked with column names in CSV header when disabling enforceSchema

GitBox Fri, 31 May 2019 08:09:57 -0700

MaxGekk commented on issue #24757: [SPARK-27873][SQL] columnNameOfCorruptRecord 
should not be checked with column names in CSV header when disabling 
enforceSchema
URL: https://github.com/apache/spark/pull/24757#issuecomment-497744330
 
 
   I would prefer another solution - passing a schema without 
`columnNameOfCorruptRecord` to the checker. In that case, you don't need to 
modify the checker at all. For example, 
https://github.com/apache/spark/blob/02e9f933097107d870dba87cc03f6003af9b0efa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVPartitionReaderFactory.scala#L49-L69
 should be changed to:
   ```Scala
       val ds = StructType(dataSchema.filterNot(_.name == 
parsedOptions.columnNameOfCorruptRecord))
       val rs = StructType(readDataSchema.filterNot(_.name == 
parsedOptions.columnNameOfCorruptRecord))
       val parser = new UnivocityParser(
         ds,
         rs,
         parsedOptions)
       val schema = if (columnPruning) rs else ds
       val isStartOfFile = file.start == 0
       val headerChecker = new CSVHeaderChecker(
         schema, parsedOptions, source = s"CSV file: ${file.filePath}", 
isStartOfFile)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] MaxGekk commented on issue #24757: [SPARK-27873][SQL] columnNameOfCorruptRecord should not be checked with column names in CSV header when disabling enforceSchema

Reply via email to