[jira] [Commented] (SPARK-25925) Spark 2.3.1 retrieves all partitions from Hive Metastore by default

Adam Budde (JIRA) Fri, 02 Nov 2018 15:34:36 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-25925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673763#comment-16673763
 ]


Adam Budde commented on SPARK-25925:
------------------------------------

It's been a while since I've thought about all this, so apologies if there are 
any inaccuracies in my reply, but I'll try to add some more context here. Prior 
to 2.1, Spark would have to inspect Parquet files in all cases to infer a case 
sensitive schema. Just using the downcased schema returned by the Hive 
Metastore can cause all sorts of silent issues, most notably when using 
predicate pushdown where using a filter field whose case doesn't match the 
actual field name in the Parquet file will result in 0 records being returned 
in all cases.

In 2.1, a feature was added where Spark will encode the case sensitive schema 
in JSON and store it as a Hive table property in order to eliminate the 
inference step. However, this change removed the inference capability entirely 
and would simply fall back on the downcased Metastore schema if the case 
sensitive schema was not found in the table properties. This resulted in any 
Hive table backed by case sensitive Parquet files breaking for the above 
reasons if it wasn't specifically created using Spark SQL 2.1 or above and was 
why I opened SPARK-19611 as this broke compatibility with hundreds of tables I 
had to support using Spark SQL. This is why 
spark.sql.hive.caseSensitiveInferenceMode was introduced.

I believe the original default value was set to NEVER_INFER for at least the 
2.1.x release series in order to maintain the same behavior as Spark 2.1.0. I 
know there was some discussion around modifying the default to INFER_AND_SAVE 
and it looks like that's the case today. This was intended as a bit of a 
sensible middle ground-- favor inferring the schema if we need to in order to 
ensure correct results in all cases (despite it causing a performance hit if 
all field names are lowercase) but save the results in the table properties as 
we would when creating a new table so we don't have to infer the schema in the 
future.

I think the NEVER_INFER option exists specifically for your usecase: the case 
sensitive schema isn't stored in the Hive table properties but you'd rather 
just use the Metastore schema directly instead of inferring one since the 
schema contains no case sensitive column names. I still think that 
INFER_AND_SAVE is indeed the better default option here as it prioritizes 
correct results for a given set of usecases over better performance for a 
different set of usecases. To be honest though, I don't care so much what the 
default option is as long as this behavior is configurable.

This is admittedly a bit of a crufty option to present users with but I don't 
think there's a good way around it given the constrains. A cleaner solution 
would be to provide a mechanism for configuring case sensitivity at the Parquet 
library level. Parquet has had some Jira issues opened around this as well as a 
[PR|https://github.com/apache/parquet-mr/pull/210] but it seems like they've 
all died on the vine.

> Spark 2.3.1 retrieves all partitions from Hive Metastore by default
> -------------------------------------------------------------------
>
>                 Key: SPARK-25925
>                 URL: https://issues.apache.org/jira/browse/SPARK-25925
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Alex Ivanov
>            Priority: Major
>
> Spark 2.3.1 comes with the following _spark-defaults.conf_ parameters by 
> default:
> {code:java}
> spark.sql.hive.convertMetastoreParquet true
> spark.sql.hive.metastorePartitionPruning true
> spark.sql.hive.caseSensitiveInferenceMode INFER_AND_SAVE{code}
> While the first two properties are fine, the last one has an unfortunate 
> side-effect. I realize it's set to INFER_AND_SAVE for a reason, namely 
> https://issues.apache.org/jira/browse/SPARK-19611, however that also causes 
> an issue.
> The problem is at this point:
> [https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L232]
> The inference causes all partitions to be retrieved for the table from Hive 
> Metastore. This is a problem because even running *explain* on a simple query 
> on a table with thousands of partitions seems to hang, and is very difficult 
> to debug.
> Moreover, many people will address the issue by changing:
> {code:java}
> spark.sql.hive.convertMetastoreParquet false{code}
> see that it works, and call it a day, thereby forgoing the benefits of using 
> Parquet support in Spark directly. In our experience, this causes significant 
> slow-downs on at least some queries.
> This Jira is mostly to document the issue, even if it cannot be addressed, so 
> that people who inevitably run into this behavior can see the resolution, 
> which is changing the parameter to *NEVER_INFER*, provided there are no 
> issues with Parquet-Hive schema compatibility, i.e. all of the schema is in 
> lower-case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25925) Spark 2.3.1 retrieves all partitions from Hive Metastore by default

Reply via email to