[
https://issues.apache.org/jira/browse/SPARK-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiao Li updated SPARK-17030:
----------------------------
Description:
Metadata cache is a key-value cache built on Google Guava Cache to speed up
building logical plan nodes (`LogicalRelation`) for data source tables. The
cache key is a unique identifier of a table. Here, the identifier is the fully
qualified table name, including the database in which it resides. (In the
future, it could be extended to a multi-part names when introducing federated
Catalog). The value is the corresponding LogicalRelation that represents a
specific data source table.
The cache is session based. In each session, the cache is managed in two
different ways at the same time.
1. **Auto loading**: when Spark querying the cache for a user-defined data
source table, the cache either returns a cached LogicalRelation, or else
automatically builds a new one by decoding the metadata fetched from the
external Catalog.
2. **Manual caching**: Hive tables are represented as logical plan nodes
MetastoreRelation. For better performance, we convert Hive serde tables to data
source tables, if convertible. The conversion is not completed at the stage of
metadata loading. Instead, it is conducted during semantic analysis. If a Hive
serde table is convertible, we first try to get the value (by the fully
qualified table name) from the metadata cache. If existed, we use it directly;
otherwise, build a new one and also push it into the cache for the future reuse.
Currently, the file `HiveMetastoreCatalog.scala` contains different
entities/functions since all of them require interaction with the cache, called
`cachedDataSourceTables`. This JIRA is to cleanup `HiveMetastoreCatalog.scala`.
**Proposal**: To avoid mixing everything related to cache in the same file, we
abstract and define the following API for cache operations. After the code
changes, `HiveMetastoreatalog.scala` only contains the cache API
implementation. The file name can be renamed to `MetadataCache.scala`
{noformat}
// cacheTable is a wrapper of cache.put(key, value). It associates value with
key in this cache.
// If the cache previously contained a value associated with key, the old value
is replaced by value.
def cacheTable(tableIdent: TableIdentifier, plan: LogicalPlan): Unit
{noformat}
{noformat}
// getTableIfPresent is a wrapper of cache.getIfPresent(key) that never causes
values to be automatically loaded.
def getTableIfPresent(tableIdent: TableIdentifier): Option[LogicalPlan]
{noformat}
{noformat}
// getTable is a wrapper of cache.get(key). If cache misses, Caches loaded by a
CacheLoader will call
// CacheLoader.load(K) to load new values into the cache. That means, it will
call the function load.
def getTable(tableIdent: TableIdentifier): LogicalPlan
{noformat}
{noformat}
// refreshTable is a wrapper of cache.invalidate. It does not eagerly reload
the cache.
// It just invalidate the cache. Next time when we use the table, it will be
populated in the cache.
def refreshTable(tableIdent: TableIdentifier): Unit
{noformat}
{noformat}
// Discards all entries in the cache. It is a wrapper of cache.invalidateAll.
def invalidateAll(): Unit
{noformat}
We should also move three Hive-specific Analyzer rules `CreateTables`,
`OrcConversions` and `ParquetConversions` from `HiveMetastoreCatalog.scala` to
`HiveStrategies.scala`.
was:
Metadata cache is a key-value cache built on Google Guava Cache to speed up
building logical plan nodes (`LogicalRelation`) for data source tables. The
cache key is a unique identifier of a table. Here, the identifier is the fully
qualified table name, including the database in which it resides. (In the
future, it could be extended to a multi-part names when introducing federated
Catalog). The value is the corresponding LogicalRelation that represents a
specific data source table.
The cache is session based. In each session, the cache is managed in two
different ways at the same time.
1. **Auto loading**: when Spark querying the cache for a user-defined data
source table, the cache either returns a cached LogicalRelation, or else
automatically builds a new one by decoding the metadata fetched from the
external Catalog.
2. **Manual caching**: Hive tables are represented as logical plan nodes
MetastoreRelation. For better performance, we convert Hive serde tables to data
source tables, if convertible. The conversion is not completed at the stage of
metadata loading. Instead, it is conducted during semantic analysis. If a Hive
serde table is convertible, we first try to get the value (by the fully
qualified table name) from the metadata cache. If existed, we use it directly;
otherwise, build a new one and also push it into the cache for the future reuse.
Currently, the file `HiveMetastoreCatalog.scala` contains different
entities/functions since all of them require interaction with the cache, called
`cachedDataSourceTables`. This PR is to cleanup `HiveMetastoreCatalog.scala`.
**Proposal**: To avoid mixing everything related to cache in the same file, we
abstract and define the following API for cache operations. After the code
changes, `HiveMetastoreatalog.scala` only contains the cache API
implementation. The file name can be renamed to `MetadataCache.scala`
{noformat}
// cacheTable is a wrapper of cache.put(key, value). It associates value with
key in this cache.
// If the cache previously contained a value associated with key, the old value
is replaced by value.
def cacheTable(tableIdent: TableIdentifier, plan: LogicalPlan): Unit
{noformat}
{noformat}
// getTableIfPresent is a wrapper of cache.getIfPresent(key) that never causes
values to be automatically loaded.
def getTableIfPresent(tableIdent: TableIdentifier): Option[LogicalPlan]
{noformat}
{noformat}
// getTable is a wrapper of cache.get(key). If cache misses, Caches loaded by a
CacheLoader will call
// CacheLoader.load(K) to load new values into the cache. That means, it will
call the function load.
def getTable(tableIdent: TableIdentifier): LogicalPlan
{noformat}
{noformat}
// refreshTable is a wrapper of cache.invalidate. It does not eagerly reload
the cache.
// It just invalidate the cache. Next time when we use the table, it will be
populated in the cache.
def refreshTable(tableIdent: TableIdentifier): Unit
{noformat}
{noformat}
// Discards all entries in the cache. It is a wrapper of cache.invalidateAll.
def invalidateAll(): Unit
{noformat}
This PR also moves three Hive-specific Analyzer rules `CreateTables`,
`OrcConversions` and `ParquetConversions` from `HiveMetastoreCatalog.scala` to
`HiveStrategies.scala`.
> Remove/Cleanup HiveMetastoreCatalog.scala
> -----------------------------------------
>
> Key: SPARK-17030
> URL: https://issues.apache.org/jira/browse/SPARK-17030
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Xiao Li
>
> Metadata cache is a key-value cache built on Google Guava Cache to speed up
> building logical plan nodes (`LogicalRelation`) for data source tables. The
> cache key is a unique identifier of a table. Here, the identifier is the
> fully qualified table name, including the database in which it resides. (In
> the future, it could be extended to a multi-part names when introducing
> federated Catalog). The value is the corresponding LogicalRelation that
> represents a specific data source table.
> The cache is session based. In each session, the cache is managed in two
> different ways at the same time.
> 1. **Auto loading**: when Spark querying the cache for a user-defined data
> source table, the cache either returns a cached LogicalRelation, or else
> automatically builds a new one by decoding the metadata fetched from the
> external Catalog.
> 2. **Manual caching**: Hive tables are represented as logical plan nodes
> MetastoreRelation. For better performance, we convert Hive serde tables to
> data source tables, if convertible. The conversion is not completed at the
> stage of metadata loading. Instead, it is conducted during semantic analysis.
> If a Hive serde table is convertible, we first try to get the value (by the
> fully qualified table name) from the metadata cache. If existed, we use it
> directly; otherwise, build a new one and also push it into the cache for the
> future reuse.
> Currently, the file `HiveMetastoreCatalog.scala` contains different
> entities/functions since all of them require interaction with the cache,
> called `cachedDataSourceTables`. This JIRA is to cleanup
> `HiveMetastoreCatalog.scala`.
> **Proposal**: To avoid mixing everything related to cache in the same file,
> we abstract and define the following API for cache operations. After the code
> changes, `HiveMetastoreatalog.scala` only contains the cache API
> implementation. The file name can be renamed to `MetadataCache.scala`
> {noformat}
> // cacheTable is a wrapper of cache.put(key, value). It associates value with
> key in this cache.
> // If the cache previously contained a value associated with key, the old
> value is replaced by value.
> def cacheTable(tableIdent: TableIdentifier, plan: LogicalPlan): Unit
> {noformat}
> {noformat}
> // getTableIfPresent is a wrapper of cache.getIfPresent(key) that never
> causes values to be automatically loaded.
> def getTableIfPresent(tableIdent: TableIdentifier): Option[LogicalPlan]
> {noformat}
> {noformat}
> // getTable is a wrapper of cache.get(key). If cache misses, Caches loaded by
> a CacheLoader will call
> // CacheLoader.load(K) to load new values into the cache. That means, it will
> call the function load.
> def getTable(tableIdent: TableIdentifier): LogicalPlan
> {noformat}
> {noformat}
> // refreshTable is a wrapper of cache.invalidate. It does not eagerly reload
> the cache.
> // It just invalidate the cache. Next time when we use the table, it will be
> populated in the cache.
> def refreshTable(tableIdent: TableIdentifier): Unit
> {noformat}
> {noformat}
> // Discards all entries in the cache. It is a wrapper of cache.invalidateAll.
> def invalidateAll(): Unit
> {noformat}
> We should also move three Hive-specific Analyzer rules `CreateTables`,
> `OrcConversions` and `ParquetConversions` from `HiveMetastoreCatalog.scala`
> to `HiveStrategies.scala`.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]