GitHub user gatorsmile opened a pull request:
https://github.com/apache/spark/pull/14618
[SPARK-17030] [SQL] Remove/Cleanup HiveMetastoreCatalog.scala
### What changes were proposed in this pull request?
Metadata cache is a key-value cache built on Google Guava Cache to speed up
building logical plan nodes (`LogicalRelation`) for data source tables. The
cache key is a unique identifier of a table. Here, the identifier is the fully
qualified table name, including the database in which it resides. (In the
future, it could be extended to a multi-part names when introducing federated
Catalog). The value is the corresponding LogicalRelation that represents a
specific data source table.
The cache is session based. In each session, the cache is managed in two
different ways at the same time.
1. **Auto loading**: when Spark querying the cache for a user-defined data
source table, the cache either returns a cached LogicalRelation, or else
automatically builds a new one by decoding the metadata fetched from the
external Catalog.
2. **Manual caching**: Hive serde tables are represented as logical plan
nodes MetastoreRelation. For better performance, we convert Hive serde tables
to data source tables, if convertible. The conversion is not completed at the
stage of metadata loading. Instead, it is conducted during semantic analysis.
If a Hive serde table is convertible, we first try to get the value (by the
fully qualified table name) from the metadata cache. If existed, we use it
directly; otherwise, build a new one and also push it into the cache for the
future reuse.
Currently, the file `HiveMetastoreCatalog.scala` contains different
entities/functions since all of them require interaction with the cache, called
`cachedDataSourceTables`. This PR is to cleanup `HiveMetastoreCatalog.scala`.
**Proposal**: To avoid mixing everything related to cache in the same file,
we abstract and define the following API for cache operations. After the code
changes, `HiveMetastoreatalog.scala` only contains the cache API
implementation. The file name can be renamed to `MetadataCache.scala`
```Scala
// cacheTable is a wrapper of cache.put(key, value). It associates value
with key in this cache.
// If the cache previously contained a value associated with key, the old
value is replaced by value.
def cacheTable(tableIdent: TableIdentifier, plan: LogicalPlan): Unit
```
```Scala
// getTableIfPresent is a wrapper of cache.getIfPresent(key) that never
causes values to be automatically loaded.
def getTableIfPresent(tableIdent: TableIdentifier): Option[LogicalPlan]
```
```Scala
// getTable is a wrapper of cache.get(key). If cache misses, Caches loaded
by a CacheLoader will call
// CacheLoader.load(K) to load new values into the cache. That means, it
will call the function load.
def getTable(tableIdent: TableIdentifier): LogicalPlan
```
```Scala
// refreshTable is a wrapper of cache.invalidate. It does not eagerly
reload the cache.
// It just invalidate the cache. Next time when we use the table, it will
be populated in the cache.
def refreshTable(tableIdent: TableIdentifier): Unit
```
```Scala
// Discards all entries in the cache. It is a wrapper of
cache.invalidateAll.
def invalidateAll(): Unit
```
This PR also moves three Hive-specific Analyzer rules `CreateTables`,
`OrcConversions` and `ParquetConversions` from `HiveMetastoreCatalog.scala` to
`HiveStrategies.scala`.
**Note:** The number of lines of changes is large and it is not easy to
review. Please check the change commit history.
**Future work:** Move `MetadataCache` to `sql/core` when we deciding to use
it as a cache for all the external catalogs.
### How was this patch tested?
Existing test cases
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/gatorsmile/spark metadataCache
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14618.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14618
----
commit 938b4303cde28e72481324bdb8f5f89b2ae1baff
Author: gatorsmile <[email protected]>
Date: 2016-08-12T03:49:15Z
remove lookupRelation
commit 43eb5ee93401c3b63ca2aaba1569a3e077450821
Author: gatorsmile <[email protected]>
Date: 2016-08-12T04:11:56Z
remove hiveDefaultTableFilePath
commit c68015e47813abd270d98da79e8981ac3b7660f5
Author: gatorsmile <[email protected]>
Date: 2016-08-12T04:25:04Z
remove CreateTables
commit 839712ff963ad5d218ac96fdd1ee1d387fc8e45f
Author: gatorsmile <[email protected]>
Date: 2016-08-12T05:54:40Z
remove OrcConversions and ParquetConversions
commit 164be254cab43f40c697729eb87109b7345f73e2
Author: gatorsmile <[email protected]>
Date: 2016-08-12T06:11:51Z
remove getCachedDataSourceTable
commit 32f7caab469343cf0d2eb9189f04b91e0b38f8c9
Author: gatorsmile <[email protected]>
Date: 2016-08-12T06:19:17Z
remove Hive dependency
commit be770f45ab7bfe8c435df08a4080f378ca4ff9de
Author: gatorsmile <[email protected]>
Date: 2016-08-12T06:31:54Z
rename HiveMetaStoreCatalog to MetadataCache
commit 6b96ee0a266d64af2512aac5bb151d5b594827e8
Author: gatorsmile <[email protected]>
Date: 2016-08-12T06:39:03Z
make cachedDataSourceTables private
commit 847a526037ae9082f18a4a93b56e6d7dcef25ff2
Author: gatorsmile <[email protected]>
Date: 2016-08-12T06:51:37Z
remove useless empty lines
commit 9fe620567aa7d61038ef497bf2358e6fff374d38
Author: gatorsmile <[email protected]>
Date: 2016-08-12T06:56:07Z
remove convertMetastoreParquet, convertMetastoreParquetWithSchemaMerging,
convertMetastoreOrc from HiveSessionState
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]