Github user mallman commented on the issue: https://github.com/apache/spark/pull/16797 >> Like you said, users can still create a hive table with mixed-case-schema parquet/orc files, by hive or other systems like presto. This table is readable for hive, and for Spark prior to 2.1, because of the runtime schema inference But this is not intentional, and Spark should not support it as the data file schema and table schema mismatch. > > I will continue to argue strongly against reducing the number of usecases Spark SQL supports out of the box. While offering a migration command can offer a helpful optimization I don't think it is acceptable as the only option for the reasons I've detailed here. > > Simply put, I think relying on the presence of Spark-specific key/value pairs in the table properties in order for Spark SQL to function properly and assuming that Spark (or Spark users) can easily alter those properties to add the table schema is too brittle for large-scale production use. I would have to agree with @budde in this case. In versions of Spark prior to 2.1, an effort was made to reconcile metastore and file format case mismatching using the method `ParquetFileFormat.mergeMetastoreParquetSchema`. The code docs for that method state that here: https://github.com/apache/spark/blob/1b02f8820ddaf3f2a0e7acc9a7f27afc20683cca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L711-L719. I don't see anything here that suggests this was a "hack" or was intended to be removed in a later version. It seems we've simply broken compatibility with a certain class of Hive tables in Spark 2.1. Schema inference is very expensive, and doing it at query time on large tables was painful in versions prior to Spark 2.1 because all metadata files were read. But it seems some people were using it nonetheless and found it useful. At least in Spark 2.1, only the files for partitions read in a query will be read for schema inference. That would significantly enhance the schema inference performance at query time for partitioned tables. Incidentally, what happens when a program outside of Spark (such as Hive) updates the Hive metastore schema of a table with the embedded Spark SQL schema? Does Spark detect that change and update the embedded schema? Does it have to redo the schema inference across all files in the table?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org