GitHub user rxin opened a pull request:
https://github.com/apache/spark/pull/11293
[SPARK-13080] [SQL] Implement new Catalog API using Hive
## What changes were proposed in this pull request?
This is a step towards merging `SQLContext` and `HiveContext`. A new
internal Catalog API was introduced in #10982 and extended in #11069. This
patch introduces an implementation of this API using `HiveClient`, an existing
interface to Hive. It also extends `HiveClient` with additional calls to Hive
that are needed to complete the catalog implementation.
*Where should I start reviewing?* The new catalog introduced is
`HiveCatalog`. This class is relatively simple because it just calls
`HiveClientImpl`, where most of the new logic is. I would not start with
`HiveClient`, `HiveQl`, or `HiveMetastoreCatalog`, which are modified mainly
because of a refactor.
*Why is this patch so big?* I had to refactor HiveClient to remove an
intermediate representation of databases, tables, partitions etc. After this
refactor `CatalogTable` convert directly to and from `HiveTable` (etc.).
Otherwise we would have to first convert `CatalogTable` to the intermediate
representation and then convert that to HiveTable, which is messy.
The new class hierarchy is as follows:
```
org.apache.spark.sql.catalyst.catalog.Catalog
- org.apache.spark.sql.catalyst.catalog.InMemoryCatalog
- org.apache.spark.sql.hive.HiveCatalog
```
Note that, as of this patch, none of these classes are currently used
anywhere yet. This will come in the future before the Spark 2.0 release.
## How was the this patch tested?
All existing unit tests, and HiveCatalogSuite that extends CatalogTestCases.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rxin/spark hive-catalog
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11293.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11293
----
commit 3b6660578f23c69abfb59fae6796ee10bf4d482d
Author: Andrew Or <[email protected]>
Date: 2016-02-10T21:16:30Z
Add skeleton for HiveCatalog
commit f3e094ad21bd38d400f90b93898995182a508e9b
Author: Andrew Or <[email protected]>
Date: 2016-02-10T21:34:36Z
Implement createDatabase
commit 4b09a7da8ddcc17a813e494d868a6ea55f01cd2e
Author: Andrew Or <[email protected]>
Date: 2016-02-10T21:48:00Z
Fix style
commit 526f278d78664c49572fd1b48495ca99d12d1896
Author: Andrew Or <[email protected]>
Date: 2016-02-10T21:59:02Z
Implement dropDatabase
commit 4aa6e66b5ee9fa2e5f8e4b9955ed98de5b35a57c
Author: Andrew Or <[email protected]>
Date: 2016-02-10T22:06:08Z
Implement alterDatabase
commit 433d180260c57a905e226f0b8686eeb92d5dc938
Author: Andrew Or <[email protected]>
Date: 2016-02-10T22:14:15Z
Implement getDatabase, listDatabases and databaseExists
commit ff5c5bea8d4d84ae56acd4caf225e59231b946ba
Author: Andrew Or <[email protected]>
Date: 2016-02-10T23:18:53Z
Implement createTable
This required converting o.a.s.sql.catalyst.catalog.Table to its
counterpart in o.a.s.sql.hive.client.HiveTable. This required
making o.a.s.sql.hive.client.TableType an enum because we need
to create one of these from name.
commit ff49f0cf6fabc645121b43b5746017c838a3551d
Author: Andrew Or <[email protected]>
Date: 2016-02-10T23:22:38Z
Explicitly mark methods with override in HiveCatalog
commit ca98c00264564717ddd427282bfff301ebdb6c70
Author: Andrew Or <[email protected]>
Date: 2016-02-10T23:25:27Z
Implement dropTable
commit 71f99646cdf30a68a8e592b80ef5a6f40685551b
Author: Andrew Or <[email protected]>
Date: 2016-02-10T23:40:37Z
Implement renameTable, alterTable
commit 13795d83c325a69fb35260c300b379e2e55725aa
Author: Andrew Or <[email protected]>
Date: 2016-02-12T00:51:36Z
Remove intermediate representation of tables, columns etc.
Currently there's the catalog table, the Spark table used in the
hive module, and the Hive table. To avoid converting to and from
between these table representations, we kill the intermediate one,
which is the one currently used throughout HiveClient and friends.
commit af5ffc0ee84f3dc3c2b9249228293ae7285f916e
Author: Andrew Or <[email protected]>
Date: 2016-02-12T01:34:24Z
Remove TableType enum
Instead, this commit introduces CatalogTableType that serves
the same purpose. This adds some type-safety and keeps the code
clean.
commit d7b18e628374659f0a792d5c5a9154711fc9073b
Author: Andrew Or <[email protected]>
Date: 2016-02-12T01:48:30Z
Re-implement all table operations after the refactor
commit a915d01eac651994c4d69b961299b476fe40f77d
Author: Andrew Or <[email protected]>
Date: 2016-02-12T20:50:39Z
Implement all partition operations
commit 3ceb88d51a6e6af92cff2e90622ba235d0d107e9
Author: Andrew Or <[email protected]>
Date: 2016-02-12T22:04:45Z
Implement all function operations
commit 07332ad6803e578d9a61cc4693d8ce665ad8c29a
Author: Andrew Or <[email protected]>
Date: 2016-02-12T22:10:33Z
Simplify alterDatabase
The operation doesn't support renaming anyway, so it doesn't
make sense to pass in a name AND a CatalogDatabase that always
has the same name.
commit cdf1f70479a6ac588249cea221b602e07d936892
Author: Andrew Or <[email protected]>
Date: 2016-02-12T22:15:55Z
Clean up HiveClientImpl a little
commit bbb81701602f97b5df43f074e33ab2a1d261926c
Author: Andrew Or <[email protected]>
Date: 2016-02-12T23:06:12Z
Merge branch 'master' of github.com:apache/spark into hive-catalog
commit 2b720256a319c9f9709801cb690f61cf1dbd0ace
Author: Andrew Or <[email protected]>
Date: 2016-02-12T23:13:01Z
Fix tests?
commit 5e2cd3afe77333ee586cb0fdfe962856b1ba2e84
Author: Andrew Or <[email protected]>
Date: 2016-02-12T23:54:32Z
Miscellaneous cleanup
commit 6519c2a8bf5e4dc8067bedad86e04a4cef0bc24f
Author: Andrew Or <[email protected]>
Date: 2016-02-16T19:03:53Z
Merge branch 'master' of github.com:apache/spark into hive-catalog
commit 7d58fac540694f21279f221b4fae489c6b4d1933
Author: Andrew Or <[email protected]>
Date: 2016-02-16T22:17:15Z
Address comments + minor cleanups
commit 1c05b9b3ce677a62062f1d90f861b20398ab42a4
Author: Andrew Or <[email protected]>
Date: 2016-02-16T22:33:13Z
Fix wrong Hive TableType issue
We used to pass CatalogTableType#toString into HiveTable, which
fails later when Hive extracts the Java enum value from the
string. This was the cause of test failures in a few test suites:
- InsertIntoHiveTableSuite
- MultiDatabaseSuite
- ParquetMetastoreSuite
- ...
commit 4ecc3b1245998d2c9743840d1243ec55770db1a9
Author: Andrew Or <[email protected]>
Date: 2016-02-16T22:55:25Z
Fix CREATE TABLE serde setting
Blatant programming mistake. This was caught by
hive.execution.SQLQuerySuite.
commit 863ebd095e7c36c740ad88ec671522a4550f0273
Author: Andrew Or <[email protected]>
Date: 2016-02-16T23:22:54Z
Fix NPE in CREATE VIEW
When we create views using HiveQl we pass in null data types
because we can't specify these types until later. This caused
a NPE downstream.
commit 539449215ebfc3df5d7b13fbd4808f7e37d20d77
Author: Andrew Or <[email protected]>
Date: 2016-02-17T21:32:36Z
Change CatalogColumn#dataType to String
This fixes a failing test in HiveCompatibilitySuite, where Spark
was ignoring the character limit in varchar but Hive respected it.
The issue was that we were converting Hive types to and from
Spark DataType, and in the process losing the limit information.
Instead of doing this conversion, we simply encode the data type
as a string so we don't loes any information. This means less
type-safety but the real fix is outside the scope of this patch.
commit fe295fb6899be00eb8a37eceb6c996cf0794ff2c
Author: Andrew Or <[email protected]>
Date: 2016-02-17T22:23:54Z
Fix style
commit 43e3c66057d37c45db7392c6793baeef05b05039
Author: Andrew Or <[email protected]>
Date: 2016-02-18T00:32:59Z
Fix MetastoreDataSourcesSuite
I missed one place where the data type was still a DataType, but
not a string.
commit 27656491561a918e4e5bec7f44ef946ef825dc19
Author: Andrew Or <[email protected]>
Date: 2016-02-18T03:04:57Z
Add HiveCatalogSuite
This suite extends the existing CatalogTestCases. Many tests
needed to be modified significantly for Hive to work. Even after
many hours spent on trying to make this work, there is still one
that doesn't pass for some reason. In particular, I was not able
to call "alterPartitions" on an existing Hive table as of this
commit. That test is temporarily ignored for now. The rest of the
tests added in this commit should pass.
commit 428c3c5cb875d2a160093a5d71f9634c2b0cb6aa
Author: Andrew Or <[email protected]>
Date: 2016-02-18T03:07:17Z
Merge branch 'master' of github.com:apache/spark into hive-catalog
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]