cloud-fan commented on a change in pull request #35229:
URL: https://github.com/apache/spark/pull/35229#discussion_r786770474



##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
##########
@@ -434,7 +434,7 @@ case class DataSource(
           hs.partitionSchema,
           "in the partition schema",
           equality)
-        DataSourceUtils.verifySchema(hs.fileFormat, hs.dataSchema)
+        DataSourceUtils.checkFieldType(hs.fileFormat, hs.dataSchema)

Review comment:
       > For ORC case, it leverages ORC schema parsing library. This schema 
validation logic is from Parquet logic too.
   
   It seems ORC is the only case that we leverage the library to validate names 
and can do 100% safe "fail fast". Parquet and Avro just write the checking code 
manually, w.r.t. some unknown references.
   
   > If that's the case, why not just removing the restriction on field name in 
both read and write?
   
   I do want to propose it...
   
   > I see the risk of allowing this is worse with potentially some corrupt 
data.
   
   If Spark can read it, then it's not corrupted. There is nothing wrong with 
Spark following the format spec and using the standard library to read a file 
if every other tool can read this file as well.
   
   Anyway, there are only 3 sources implementing this name check: parquet, orc 
and avro. I'm ok to narrow this fix to parquet first, and fix orc and avro 
later.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to