[ 
https://issues.apache.org/jira/browse/SPARK-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-17030:
---------------------------------
    Labels: bulk-closed  (was: )

> Remove/Cleanup HiveMetastoreCatalog.scala
> -----------------------------------------
>
>                 Key: SPARK-17030
>                 URL: https://issues.apache.org/jira/browse/SPARK-17030
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Xiao Li
>            Priority: Major
>              Labels: bulk-closed
>
> Metadata cache is a key-value cache built on Google Guava Cache to speed up 
> building logical plan nodes (`LogicalRelation`) for data source tables. The 
> cache key is a unique identifier of a table. Here, the identifier is the 
> fully qualified table name, including the database in which it resides. (In 
> the future, it could be extended to a multi-part names when introducing 
> federated Catalog). The value is the corresponding LogicalRelation that 
> represents a specific data source table.  
> The cache is session based. In each session, the cache is managed in two 
> different ways at the same time. 
> 1. **Auto loading**: when Spark querying the cache for a user-defined data 
> source table, the cache either returns a cached LogicalRelation, or else 
> automatically builds a new one by decoding the metadata fetched from the 
> external Catalog. 
> 2. **Manual caching**: Hive tables are represented as logical plan nodes 
> MetastoreRelation. For better performance, we convert Hive serde tables to 
> data source tables, if convertible. The conversion is not completed at the 
> stage of metadata loading. Instead, it is conducted during semantic analysis. 
> If a Hive serde table is convertible, we first try to get the value (by the 
> fully qualified table name) from the metadata cache. If existed, we use it 
> directly; otherwise, build a new one and also push it into the cache for the 
> future reuse.
> Currently, the file `HiveMetastoreCatalog.scala` contains different 
> entities/functions since all of them require interaction with the cache, 
> called `cachedDataSourceTables`. This JIRA is to cleanup 
> `HiveMetastoreCatalog.scala`. 
> **Proposal**: To avoid mixing everything related to cache in the same file, 
> we abstract and define the following API for cache operations. After the code 
> changes, `HiveMetastoreatalog.scala` only contains the cache API 
> implementation. The file name can be renamed to `MetadataCache.scala`
> {noformat}
> // cacheTable is a wrapper of cache.put(key, value). It associates value with 
> key in this cache.
> // If the cache previously contained a value associated with key, the old 
> value is replaced by value.
> def cacheTable(tableIdent: TableIdentifier, plan: LogicalPlan): Unit
> {noformat}
> {noformat}
> // getTableIfPresent is a wrapper of cache.getIfPresent(key) that never 
> causes values to be automatically loaded.
> def getTableIfPresent(tableIdent: TableIdentifier): Option[LogicalPlan]
> {noformat}
> {noformat}
> // getTable is a wrapper of cache.get(key). If cache misses, Caches loaded by 
> a CacheLoader will call
> // CacheLoader.load(K) to load new values into the cache. That means, it will 
> call the function load.
> def getTable(tableIdent: TableIdentifier): LogicalPlan
> {noformat}
> {noformat}
> // refreshTable is a wrapper of cache.invalidate. It does not eagerly reload 
> the cache.
> // It just invalidate the cache. Next time when we use the table, it will be 
> populated in the cache.
> def refreshTable(tableIdent: TableIdentifier): Unit
> {noformat}
> {noformat}
> // Discards all entries in the cache. It is a wrapper of cache.invalidateAll.
> def invalidateAll(): Unit
> {noformat}
> We should also move three Hive-specific Analyzer rules `CreateTables`, 
> `OrcConversions` and `ParquetConversions` from `HiveMetastoreCatalog.scala` 
> to `HiveStrategies.scala`. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to