[GitHub] [spark] HyukjinKwon commented on a change in pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data

GitBox Tue, 18 Jan 2022 05:00:27 -0800


HyukjinKwon commented on a change in pull request #35229:
URL: https://github.com/apache/spark/pull/35229#discussion_r786733726




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
##########
@@ -434,7 +434,7 @@ case class DataSource(
           hs.partitionSchema,
           "in the partition schema",
           equality)
-        DataSourceUtils.verifySchema(hs.fileFormat, hs.dataSchema)
+        DataSourceUtils.checkFieldType(hs.fileFormat, hs.dataSchema)

Review comment:
       It's not a guess. For ORC case, it leverages ORC schema parsing library. 
This schema validation logic is from Parquet logic too.
   
   >  It also implies that Parquet has no limitation on the field name.
   
   If that's the case, why not just removing the restriction on field name in 
both read and write?
   
   > comparing to the risk of not being able to read valid files and stopping 
users to use Spark.
   
   I see the risk of allowing this is worse with potentially some corrupt data. 
   
   Can we narrow this fix to Parquet only for now?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data

Reply via email to