[ https://issues.apache.org/jira/browse/SPARK-25925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673763#comment-16673763 ]
Adam Budde commented on SPARK-25925: ------------------------------------ It's been a while since I've thought about all this, so apologies if there are any inaccuracies in my reply, but I'll try to add some more context here. Prior to 2.1, Spark would have to inspect Parquet files in all cases to infer a case sensitive schema. Just using the downcased schema returned by the Hive Metastore can cause all sorts of silent issues, most notably when using predicate pushdown where using a filter field whose case doesn't match the actual field name in the Parquet file will result in 0 records being returned in all cases. In 2.1, a feature was added where Spark will encode the case sensitive schema in JSON and store it as a Hive table property in order to eliminate the inference step. However, this change removed the inference capability entirely and would simply fall back on the downcased Metastore schema if the case sensitive schema was not found in the table properties. This resulted in any Hive table backed by case sensitive Parquet files breaking for the above reasons if it wasn't specifically created using Spark SQL 2.1 or above and was why I opened SPARK-19611 as this broke compatibility with hundreds of tables I had to support using Spark SQL. This is why spark.sql.hive.caseSensitiveInferenceMode was introduced. I believe the original default value was set to NEVER_INFER for at least the 2.1.x release series in order to maintain the same behavior as Spark 2.1.0. I know there was some discussion around modifying the default to INFER_AND_SAVE and it looks like that's the case today. This was intended as a bit of a sensible middle ground-- favor inferring the schema if we need to in order to ensure correct results in all cases (despite it causing a performance hit if all field names are lowercase) but save the results in the table properties as we would when creating a new table so we don't have to infer the schema in the future. I think the NEVER_INFER option exists specifically for your usecase: the case sensitive schema isn't stored in the Hive table properties but you'd rather just use the Metastore schema directly instead of inferring one since the schema contains no case sensitive column names. I still think that INFER_AND_SAVE is indeed the better default option here as it prioritizes correct results for a given set of usecases over better performance for a different set of usecases. To be honest though, I don't care so much what the default option is as long as this behavior is configurable. This is admittedly a bit of a crufty option to present users with but I don't think there's a good way around it given the constrains. A cleaner solution would be to provide a mechanism for configuring case sensitivity at the Parquet library level. Parquet has had some Jira issues opened around this as well as a [PR|https://github.com/apache/parquet-mr/pull/210] but it seems like they've all died on the vine. > Spark 2.3.1 retrieves all partitions from Hive Metastore by default > ------------------------------------------------------------------- > > Key: SPARK-25925 > URL: https://issues.apache.org/jira/browse/SPARK-25925 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.1 > Reporter: Alex Ivanov > Priority: Major > > Spark 2.3.1 comes with the following _spark-defaults.conf_ parameters by > default: > {code:java} > spark.sql.hive.convertMetastoreParquet true > spark.sql.hive.metastorePartitionPruning true > spark.sql.hive.caseSensitiveInferenceMode INFER_AND_SAVE{code} > While the first two properties are fine, the last one has an unfortunate > side-effect. I realize it's set to INFER_AND_SAVE for a reason, namely > https://issues.apache.org/jira/browse/SPARK-19611, however that also causes > an issue. > The problem is at this point: > [https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L232] > The inference causes all partitions to be retrieved for the table from Hive > Metastore. This is a problem because even running *explain* on a simple query > on a table with thousands of partitions seems to hang, and is very difficult > to debug. > Moreover, many people will address the issue by changing: > {code:java} > spark.sql.hive.convertMetastoreParquet false{code} > see that it works, and call it a day, thereby forgoing the benefits of using > Parquet support in Spark directly. In our experience, this causes significant > slow-downs on at least some queries. > This Jira is mostly to document the issue, even if it cannot be addressed, so > that people who inevitably run into this behavior can see the resolution, > which is changing the parameter to *NEVER_INFER*, provided there are no > issues with Parquet-Hive schema compatibility, i.e. all of the schema is in > lower-case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org