[ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226640#comment-16226640
 ] 

Wenchen Fan commented on SPARK-22306:
-------------------------------------

This is a known issue before Spark 2.3: ALTER TABLE at Spark side erases the 
bucketing information of a hive table. However, for this certain case, the 
ALTER TABLE is triggered automatically, which it's pretty bad for users because 
of this bug.

I'm going to handle this case specially and suggest users to upgrade to Spark 
2.3.

> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> -----------------------------------------------------------------------
>
>                 Key: SPARK-22306
>                 URL: https://issues.apache.org/jira/browse/SPARK-22306
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>         Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>            Reporter: David Malinge
>            Priority: Critical
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch 
> is good due to SPARK-17729. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is affected and only Hive may suffer from this.
> {code}
> hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
> INTO 10 BUCKETS STORED AS PARQUET;
> hive> INSERT INTO t VALUES('a','b');
> hive> DESC FORMATTED t;
> ...
> Num Buckets:          10
> Bucket Columns:       [a, b]
> Sort Columns:         [Order(col:a, order:1), Order(col:b, order:1)]
> scala> sql("SELECT * FROM t").show(false)
> hive> DESC FORMATTED t;
> Num Buckets:          -1
> Bucket Columns:       []
> Sort Columns:         []
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to