[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

vanzin Fri, 04 Aug 2017 16:26:01 -0700

GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/18849


    [SPARK-21617][SQL] Store correct table metadata when altering schema in 
Hive metastore.

    HiveExternalCatalog.alterTableSchema takes a shortcut by modifying the raw
    Hive table metadata instead of the full Spark view; that means it needs to
    be aware of whether the table is Hive-compatible or not.
    
    For compatible tables, the current "replace the schema" code is the correct
    path, except that an exception in that path should result in an error, and
    not in retrying in a different way.
    
    For non-compatible tables, Spark should just update the table properties,
    and leave the schema stored in the raw table untouched.
    
    Because Spark doesn't explicitly store metadata about whether a table is
    Hive-compatible or not, a new property was added just to make that explicit.
    The code tries to detect old DS tables that don't have the property and do
    the right thing.
    
    These changes also uncovered a problem with the way case-sensitive DS tables
    were being saved to the Hive metastore; the metastore is case-insensitive,
    and the code was treating these tables as Hive-compatible if the data source
    had a Hive counterpart (e.g. for parquet). In this scenario, the schema
    could be corrupted when being updated from Spark if conflicting columns 
existed
    ignoring case. The change fixes this by making case-sensitive DS-tables not
    Hive-compatible.
    
    Tested with existing and added unit tests (plus internal tests with a 2.1 
metastore).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-21617

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18849.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18849
    
----
commit aae3abd673adc7ff939d842e49d566fa722403a3
Author: Marcelo Vanzin <[email protected]>
Date:   2017-08-02T21:47:34Z

    [SPARK-21617][SQL] Store correct metadata in Hive for altered DS table.
    
    This change fixes two issues:
    - when loading table metadata from Hive, restore the "provider" field of
      CatalogTable so DS tables can be identified.
    - when altering a DS table in the Hive metastore, make sure to not alter
      the table's schema, since the DS table's schema is stored as a table
      property in those cases.
    
    Also added a new unit test for this issue which fails without this change.

commit 2350b105a599dde849e44bde50aa6d13812e4f83
Author: Marcelo Vanzin <[email protected]>
Date:   2017-08-04T22:49:31Z

    Fix 2.1 DDL suite to not use SparkSession.

commit 7ccf4743024a8a447a4b05369f6ebf237cf88c4f
Author: Marcelo Vanzin <[email protected]>
Date:   2017-08-04T22:57:44Z

    Proper fix.
    
    HiveExternalCatalog.alterTableSchema takes a shortcut by modifying the raw
    Hive table metadata instead of the full Spark view; that means it needs to
    be aware of whether the table is Hive-compatible or not.
    
    For compatible tables, the current "replace the schema" code is the correct
    path, except that an exception in that path should result in an error, and
    not in retrying in a different way.
    
    For non-compatible tables, Spark should just update the table properties,
    and leave the schema stored in the raw table untouched.
    
    Because Spark doesn't explicitly store metadata about whether a table is
    Hive-compatible or not, a new property was added just to make that explicit.
    The code tries to detect old DS tables that don't have the property and do
    the right thing.
    
    These changes also uncovered a problem with the way case-sensitive DS tables
    were being saved to the Hive metastore; the metastore is case-insensitive,
    and the code was treating these tables as Hive-compatible if the data source
    had a Hive counterpart (e.g. for parquet). In this scenario, the schema
    could be corrupted when being updated from Spark if conflicting columns 
existed
    ignoring case. The change fixes this by making case-sensitive DS-tables not
    Hive-compatible.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Reply via email to