Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

via GitHub Fri, 26 Jan 2024 01:28:36 -0800


MaxGekk commented on code in PR #44872:
URL: https://github.com/apache/spark/pull/44872#discussion_r1467408202



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVPartitionReaderFactory.scala:
##########
@@ -58,7 +58,7 @@ case class CSVPartitionReaderFactory(
       actualReadDataSchema,
       options,
       filters)
-    val schema = if (options.columnPruning) actualReadDataSchema else 
actualDataSchema
+    val schema = if (options.isColumnPruningEnabled) actualReadDataSchema else 
actualDataSchema

Review Comment:
   The `schema` is used only in `CSVHeaderChecker` which is supposed to check 
column names in CSV and provided schema fields. It shouldn't depend on the 
column pruning feature at all, from my point of view.
   
   ```scala
     private def checkHeaderColumnNames(columnNames: Array[String]): Unit = {
   ...
         if (headerLen == schemaSize) {
   ...
         } else {
           errorMessage = Some(
             s"""|Number of column in CSV header is not equal to number of 
fields in the schema:
                 | Header length: $headerLen, schema size: $schemaSize
                 |$source""".stripMargin)
         }
   ```
   
   `schemaSize` must be **full data schema** of CSV filed, but not the required 
schema.
   
   Let me re-think it, and avoid the dependency from the column pruning in 
`CSVHeaderChecker`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Reply via email to