[GitHub] spark pull request #14618: [SPARK-17030] [SQL] Remove/Cleanup HiveMetastoreC...

gatorsmile Fri, 12 Aug 2016 00:03:07 -0700

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/14618


    [SPARK-17030] [SQL] Remove/Cleanup HiveMetastoreCatalog.scala

    ### What changes were proposed in this pull request?
    Metadata cache is a key-value cache built on Google Guava Cache to speed up 
building logical plan nodes (`LogicalRelation`) for data source tables. The 
cache key is a unique identifier of a table. Here, the identifier is the fully 
qualified table name, including the database in which it resides. (In the 
future, it could be extended to a multi-part names when introducing federated 
Catalog). The value is the corresponding LogicalRelation that represents a 
specific data source table.  
    The cache is session based. In each session, the cache is managed in two 
different ways at the same time. 
    
    1. **Auto loading**: when Spark querying the cache for a user-defined data 
source table, the cache either returns a cached LogicalRelation, or else 
automatically builds a new one by decoding the metadata fetched from the 
external Catalog. 
    2. **Manual caching**: Hive serde tables are represented as logical plan 
nodes MetastoreRelation. For better performance, we convert Hive serde tables 
to data source tables, if convertible. The conversion is not completed at the 
stage of metadata loading. Instead, it is conducted during semantic analysis. 
If a Hive serde table is convertible, we first try to get the value (by the 
fully qualified table name) from the metadata cache. If existed, we use it 
directly; otherwise, build a new one and also push it into the cache for the 
future reuse.
    
    Currently, the file `HiveMetastoreCatalog.scala` contains different 
entities/functions since all of them require interaction with the cache, called 
`cachedDataSourceTables`. This PR is to cleanup `HiveMetastoreCatalog.scala`. 
    
    **Proposal**: To avoid mixing everything related to cache in the same file, 
we abstract and define the following API for cache operations. After the code 
changes, `HiveMetastoreatalog.scala` only contains the cache API 
implementation. The file name can be renamed to `MetadataCache.scala`
    
    ```Scala
    // cacheTable is a wrapper of cache.put(key, value). It associates value 
with key in this cache.
    // If the cache previously contained a value associated with key, the old 
value is replaced by value.
    def cacheTable(tableIdent: TableIdentifier, plan: LogicalPlan): Unit
    ```
    
    ```Scala
    // getTableIfPresent is a wrapper of cache.getIfPresent(key) that never 
causes values to be automatically loaded.
    def getTableIfPresent(tableIdent: TableIdentifier): Option[LogicalPlan]
    ```
    
    ```Scala
    // getTable is a wrapper of cache.get(key). If cache misses, Caches loaded 
by a CacheLoader will call
    // CacheLoader.load(K) to load new values into the cache. That means, it 
will call the function load.
    def getTable(tableIdent: TableIdentifier): LogicalPlan
    ```
    
    ```Scala
    // refreshTable is a wrapper of cache.invalidate. It does not eagerly 
reload the cache.
    // It just invalidate the cache. Next time when we use the table, it will 
be populated in the cache.
    def refreshTable(tableIdent: TableIdentifier): Unit
    ```
    
    ```Scala
    // Discards all entries in the cache. It is a wrapper of 
cache.invalidateAll.
    def invalidateAll(): Unit
    ```
    
    This PR also moves three Hive-specific Analyzer rules `CreateTables`, 
`OrcConversions` and `ParquetConversions` from `HiveMetastoreCatalog.scala` to 
`HiveStrategies.scala`.
    
    **Note:** The number of lines of changes is large and it is not easy to 
review. Please check the change commit history. 
    
    **Future work:** Move `MetadataCache` to `sql/core` when we deciding to use 
it as a cache for all the external catalogs.
    
    ### How was this patch tested?
    Existing test cases

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark metadataCache

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14618.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14618
    
----
commit 938b4303cde28e72481324bdb8f5f89b2ae1baff
Author: gatorsmile <[email protected]>
Date:   2016-08-12T03:49:15Z

    remove lookupRelation

commit 43eb5ee93401c3b63ca2aaba1569a3e077450821
Author: gatorsmile <[email protected]>
Date:   2016-08-12T04:11:56Z

    remove hiveDefaultTableFilePath

commit c68015e47813abd270d98da79e8981ac3b7660f5
Author: gatorsmile <[email protected]>
Date:   2016-08-12T04:25:04Z

    remove CreateTables

commit 839712ff963ad5d218ac96fdd1ee1d387fc8e45f
Author: gatorsmile <[email protected]>
Date:   2016-08-12T05:54:40Z

    remove OrcConversions and ParquetConversions

commit 164be254cab43f40c697729eb87109b7345f73e2
Author: gatorsmile <[email protected]>
Date:   2016-08-12T06:11:51Z

    remove getCachedDataSourceTable

commit 32f7caab469343cf0d2eb9189f04b91e0b38f8c9
Author: gatorsmile <[email protected]>
Date:   2016-08-12T06:19:17Z

    remove Hive dependency

commit be770f45ab7bfe8c435df08a4080f378ca4ff9de
Author: gatorsmile <[email protected]>
Date:   2016-08-12T06:31:54Z

    rename HiveMetaStoreCatalog to MetadataCache

commit 6b96ee0a266d64af2512aac5bb151d5b594827e8
Author: gatorsmile <[email protected]>
Date:   2016-08-12T06:39:03Z

    make cachedDataSourceTables private

commit 847a526037ae9082f18a4a93b56e6d7dcef25ff2
Author: gatorsmile <[email protected]>
Date:   2016-08-12T06:51:37Z

    remove useless empty lines

commit 9fe620567aa7d61038ef497bf2358e6fff374d38
Author: gatorsmile <[email protected]>
Date:   2016-08-12T06:56:07Z

    remove convertMetastoreParquet, convertMetastoreParquetWithSchemaMerging, 
convertMetastoreOrc from HiveSessionState

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #14618: [SPARK-17030] [SQL] Remove/Cleanup HiveMetastoreC...

Reply via email to