GitHub user vanzin opened a pull request:
https://github.com/apache/spark/pull/18849
[SPARK-21617][SQL] Store correct table metadata when altering schema in
Hive metastore.
HiveExternalCatalog.alterTableSchema takes a shortcut by modifying the raw
Hive table metadata instead of the full Spark view; that means it needs to
be aware of whether the table is Hive-compatible or not.
For compatible tables, the current "replace the schema" code is the correct
path, except that an exception in that path should result in an error, and
not in retrying in a different way.
For non-compatible tables, Spark should just update the table properties,
and leave the schema stored in the raw table untouched.
Because Spark doesn't explicitly store metadata about whether a table is
Hive-compatible or not, a new property was added just to make that explicit.
The code tries to detect old DS tables that don't have the property and do
the right thing.
These changes also uncovered a problem with the way case-sensitive DS tables
were being saved to the Hive metastore; the metastore is case-insensitive,
and the code was treating these tables as Hive-compatible if the data source
had a Hive counterpart (e.g. for parquet). In this scenario, the schema
could be corrupted when being updated from Spark if conflicting columns
existed
ignoring case. The change fixes this by making case-sensitive DS-tables not
Hive-compatible.
Tested with existing and added unit tests (plus internal tests with a 2.1
metastore).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/vanzin/spark SPARK-21617
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18849.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18849
----
commit aae3abd673adc7ff939d842e49d566fa722403a3
Author: Marcelo Vanzin <[email protected]>
Date: 2017-08-02T21:47:34Z
[SPARK-21617][SQL] Store correct metadata in Hive for altered DS table.
This change fixes two issues:
- when loading table metadata from Hive, restore the "provider" field of
CatalogTable so DS tables can be identified.
- when altering a DS table in the Hive metastore, make sure to not alter
the table's schema, since the DS table's schema is stored as a table
property in those cases.
Also added a new unit test for this issue which fails without this change.
commit 2350b105a599dde849e44bde50aa6d13812e4f83
Author: Marcelo Vanzin <[email protected]>
Date: 2017-08-04T22:49:31Z
Fix 2.1 DDL suite to not use SparkSession.
commit 7ccf4743024a8a447a4b05369f6ebf237cf88c4f
Author: Marcelo Vanzin <[email protected]>
Date: 2017-08-04T22:57:44Z
Proper fix.
HiveExternalCatalog.alterTableSchema takes a shortcut by modifying the raw
Hive table metadata instead of the full Spark view; that means it needs to
be aware of whether the table is Hive-compatible or not.
For compatible tables, the current "replace the schema" code is the correct
path, except that an exception in that path should result in an error, and
not in retrying in a different way.
For non-compatible tables, Spark should just update the table properties,
and leave the schema stored in the raw table untouched.
Because Spark doesn't explicitly store metadata about whether a table is
Hive-compatible or not, a new property was added just to make that explicit.
The code tries to detect old DS tables that don't have the property and do
the right thing.
These changes also uncovered a problem with the way case-sensitive DS tables
were being saved to the Hive metastore; the metastore is case-insensitive,
and the code was treating these tables as Hive-compatible if the data source
had a Hive counterpart (e.g. for parquet). In this scenario, the schema
could be corrupted when being updated from Spark if conflicting columns
existed
ignoring case. The change fixes this by making case-sensitive DS-tables not
Hive-compatible.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]