[ 
https://issues.apache.org/jira/browse/SPARK-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17030:
----------------------------
    Description: 
Metadata cache is a key-value cache built on Google Guava Cache to speed up 
building logical plan nodes (`LogicalRelation`) for data source tables. The 
cache key is a unique identifier of a table. Here, the identifier is the fully 
qualified table name, including the database in which it resides. (In the 
future, it could be extended to a multi-part names when introducing federated 
Catalog). The value is the corresponding LogicalRelation that represents a 
specific data source table.  
The cache is session based. In each session, the cache is managed in two 
different ways at the same time. 

1. **Auto loading**: when Spark querying the cache for a user-defined data 
source table, the cache either returns a cached LogicalRelation, or else 
automatically builds a new one by decoding the metadata fetched from the 
external Catalog. 
2. **Manual caching**: Hive tables are represented as logical plan nodes 
MetastoreRelation. For better performance, we convert Hive serde tables to data 
source tables, if convertible. The conversion is not completed at the stage of 
metadata loading. Instead, it is conducted during semantic analysis. If a Hive 
serde table is convertible, we first try to get the value (by the fully 
qualified table name) from the metadata cache. If existed, we use it directly; 
otherwise, build a new one and also push it into the cache for the future reuse.

Currently, the file `HiveMetastoreCatalog.scala` contains different 
entities/functions since all of them require interaction with the cache, called 
`cachedDataSourceTables`. This JIRA is to cleanup `HiveMetastoreCatalog.scala`. 

**Proposal**: To avoid mixing everything related to cache in the same file, we 
abstract and define the following API for cache operations. After the code 
changes, `HiveMetastoreatalog.scala` only contains the cache API 
implementation. The file name can be renamed to `MetadataCache.scala`

{noformat}
// cacheTable is a wrapper of cache.put(key, value). It associates value with 
key in this cache.
// If the cache previously contained a value associated with key, the old value 
is replaced by value.
def cacheTable(tableIdent: TableIdentifier, plan: LogicalPlan): Unit
{noformat}

{noformat}
// getTableIfPresent is a wrapper of cache.getIfPresent(key) that never causes 
values to be automatically loaded.
def getTableIfPresent(tableIdent: TableIdentifier): Option[LogicalPlan]
{noformat}

{noformat}
// getTable is a wrapper of cache.get(key). If cache misses, Caches loaded by a 
CacheLoader will call
// CacheLoader.load(K) to load new values into the cache. That means, it will 
call the function load.
def getTable(tableIdent: TableIdentifier): LogicalPlan
{noformat}

{noformat}
// refreshTable is a wrapper of cache.invalidate. It does not eagerly reload 
the cache.
// It just invalidate the cache. Next time when we use the table, it will be 
populated in the cache.
def refreshTable(tableIdent: TableIdentifier): Unit
{noformat}

{noformat}
// Discards all entries in the cache. It is a wrapper of cache.invalidateAll.
def invalidateAll(): Unit
{noformat}

We should also move three Hive-specific Analyzer rules `CreateTables`, 
`OrcConversions` and `ParquetConversions` from `HiveMetastoreCatalog.scala` to 
`HiveStrategies.scala`. 


  was:
Metadata cache is a key-value cache built on Google Guava Cache to speed up 
building logical plan nodes (`LogicalRelation`) for data source tables. The 
cache key is a unique identifier of a table. Here, the identifier is the fully 
qualified table name, including the database in which it resides. (In the 
future, it could be extended to a multi-part names when introducing federated 
Catalog). The value is the corresponding LogicalRelation that represents a 
specific data source table.  
The cache is session based. In each session, the cache is managed in two 
different ways at the same time. 

1. **Auto loading**: when Spark querying the cache for a user-defined data 
source table, the cache either returns a cached LogicalRelation, or else 
automatically builds a new one by decoding the metadata fetched from the 
external Catalog. 
2. **Manual caching**: Hive tables are represented as logical plan nodes 
MetastoreRelation. For better performance, we convert Hive serde tables to data 
source tables, if convertible. The conversion is not completed at the stage of 
metadata loading. Instead, it is conducted during semantic analysis. If a Hive 
serde table is convertible, we first try to get the value (by the fully 
qualified table name) from the metadata cache. If existed, we use it directly; 
otherwise, build a new one and also push it into the cache for the future reuse.

Currently, the file `HiveMetastoreCatalog.scala` contains different 
entities/functions since all of them require interaction with the cache, called 
`cachedDataSourceTables`. This PR is to cleanup `HiveMetastoreCatalog.scala`. 

**Proposal**: To avoid mixing everything related to cache in the same file, we 
abstract and define the following API for cache operations. After the code 
changes, `HiveMetastoreatalog.scala` only contains the cache API 
implementation. The file name can be renamed to `MetadataCache.scala`

{noformat}
// cacheTable is a wrapper of cache.put(key, value). It associates value with 
key in this cache.
// If the cache previously contained a value associated with key, the old value 
is replaced by value.
def cacheTable(tableIdent: TableIdentifier, plan: LogicalPlan): Unit
{noformat}

{noformat}
// getTableIfPresent is a wrapper of cache.getIfPresent(key) that never causes 
values to be automatically loaded.
def getTableIfPresent(tableIdent: TableIdentifier): Option[LogicalPlan]
{noformat}

{noformat}
// getTable is a wrapper of cache.get(key). If cache misses, Caches loaded by a 
CacheLoader will call
// CacheLoader.load(K) to load new values into the cache. That means, it will 
call the function load.
def getTable(tableIdent: TableIdentifier): LogicalPlan
{noformat}

{noformat}
// refreshTable is a wrapper of cache.invalidate. It does not eagerly reload 
the cache.
// It just invalidate the cache. Next time when we use the table, it will be 
populated in the cache.
def refreshTable(tableIdent: TableIdentifier): Unit
{noformat}

{noformat}
// Discards all entries in the cache. It is a wrapper of cache.invalidateAll.
def invalidateAll(): Unit
{noformat}

This PR also moves three Hive-specific Analyzer rules `CreateTables`, 
`OrcConversions` and `ParquetConversions` from `HiveMetastoreCatalog.scala` to 
`HiveStrategies.scala`. 



> Remove/Cleanup HiveMetastoreCatalog.scala
> -----------------------------------------
>
>                 Key: SPARK-17030
>                 URL: https://issues.apache.org/jira/browse/SPARK-17030
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Xiao Li
>
> Metadata cache is a key-value cache built on Google Guava Cache to speed up 
> building logical plan nodes (`LogicalRelation`) for data source tables. The 
> cache key is a unique identifier of a table. Here, the identifier is the 
> fully qualified table name, including the database in which it resides. (In 
> the future, it could be extended to a multi-part names when introducing 
> federated Catalog). The value is the corresponding LogicalRelation that 
> represents a specific data source table.  
> The cache is session based. In each session, the cache is managed in two 
> different ways at the same time. 
> 1. **Auto loading**: when Spark querying the cache for a user-defined data 
> source table, the cache either returns a cached LogicalRelation, or else 
> automatically builds a new one by decoding the metadata fetched from the 
> external Catalog. 
> 2. **Manual caching**: Hive tables are represented as logical plan nodes 
> MetastoreRelation. For better performance, we convert Hive serde tables to 
> data source tables, if convertible. The conversion is not completed at the 
> stage of metadata loading. Instead, it is conducted during semantic analysis. 
> If a Hive serde table is convertible, we first try to get the value (by the 
> fully qualified table name) from the metadata cache. If existed, we use it 
> directly; otherwise, build a new one and also push it into the cache for the 
> future reuse.
> Currently, the file `HiveMetastoreCatalog.scala` contains different 
> entities/functions since all of them require interaction with the cache, 
> called `cachedDataSourceTables`. This JIRA is to cleanup 
> `HiveMetastoreCatalog.scala`. 
> **Proposal**: To avoid mixing everything related to cache in the same file, 
> we abstract and define the following API for cache operations. After the code 
> changes, `HiveMetastoreatalog.scala` only contains the cache API 
> implementation. The file name can be renamed to `MetadataCache.scala`
> {noformat}
> // cacheTable is a wrapper of cache.put(key, value). It associates value with 
> key in this cache.
> // If the cache previously contained a value associated with key, the old 
> value is replaced by value.
> def cacheTable(tableIdent: TableIdentifier, plan: LogicalPlan): Unit
> {noformat}
> {noformat}
> // getTableIfPresent is a wrapper of cache.getIfPresent(key) that never 
> causes values to be automatically loaded.
> def getTableIfPresent(tableIdent: TableIdentifier): Option[LogicalPlan]
> {noformat}
> {noformat}
> // getTable is a wrapper of cache.get(key). If cache misses, Caches loaded by 
> a CacheLoader will call
> // CacheLoader.load(K) to load new values into the cache. That means, it will 
> call the function load.
> def getTable(tableIdent: TableIdentifier): LogicalPlan
> {noformat}
> {noformat}
> // refreshTable is a wrapper of cache.invalidate. It does not eagerly reload 
> the cache.
> // It just invalidate the cache. Next time when we use the table, it will be 
> populated in the cache.
> def refreshTable(tableIdent: TableIdentifier): Unit
> {noformat}
> {noformat}
> // Discards all entries in the cache. It is a wrapper of cache.invalidateAll.
> def invalidateAll(): Unit
> {noformat}
> We should also move three Hive-specific Analyzer rules `CreateTables`, 
> `OrcConversions` and `ParquetConversions` from `HiveMetastoreCatalog.scala` 
> to `HiveStrategies.scala`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to