GitHub user marmbrus opened a pull request:
https://github.com/apache/spark/pull/2501
[WIP][SPARK-3212][SQL] Use logical plan matching instead of temporary
tables for table caching
_Also addresses: SPARK-1379 and SPARK-3641_
This PR introduces a new trait, `CacheManger`, which replaces the previous
temporary table based caching system. Instead of creating a temporary table,
which shadows an existing table but provides a cached representation, the
cached manager maintains a separate list of cached data. After optimization,
this list is searched for any matching plan fragments. When a matching plan
fragment is found it is replaced with the cached data.
There are several advantages to this approach:
- Calling .cache() on a SchemaRDD now works as you would expect, and uses
the more efficient columnar representation.
- Its now possible to provide a list of temporary tables, without having
to decide if a given table is actually just a cached persistent table. (To be
done in a follow-up PR)
- In some cases it is possible that cached data will be used, even if a
cached table was not explicitly requested. This is because we now look at the
logical structure instead of the table name.
TODO:
- [ ] Finish cleanup of caching specific pattern matching code
- [ ] More test cases for `sameResult` function
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/marmbrus/spark caching
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2501.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2501
----
commit 80f26acffa8e234434fb8e080c499e6cae9fe6e4
Author: Michael Armbrust <[email protected]>
Date: 2014-09-23T02:41:57Z
First draft of improved semantics for Spark SQL caching.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]