GitHub user marmbrus opened a pull request:

    https://github.com/apache/spark/pull/2912

    [SPARK-4050][SQL] Fix caching of temporary tables with projections.

    Previously cached data was found by `sameResult` plan matching on optimized 
plans.  This technique however fails to locate the cached data when a temporary 
table with a projection is queried with a further reduced projection.  The 
failure is due to the fact that optimization will collapse the projections, 
producing a plan that no longer produces the sameResult as the cached data 
(though the cached data still subsumes the desired data).  For example consider 
the following previously failing test case.
    
    ```scala
    sql("CACHE TABLE tempTable AS SELECT key FROM testData")
    assertCached(sql("SELECT COUNT(*) FROM tempTable"))
    ```
    
    In this PR I change the matching to occur after analysis instead of 
optimization, so that in the case of temporary tables, the plans will always 
match.  I think this should work generally, however, this error does raise 
questions about the need to do more thorough subsumption checking when locating 
cached data.
    
    Another question is what sort of semantics we want to provide when 
uncaching data from temporary tables.  For example consider the following 
sequence of commands:
    
    ```scala
    testData.select('key).registerTempTable("tempTable1")
    testData.select('key).registerTempTable("tempTable2")
    cacheTable("tempTable1")
    
    // This obviously works.
    assertCached(sql("SELECT COUNT(*) FROM tempTable1"))
    
    // It seems good that this works ...
    assertCached(sql("SELECT COUNT(*) FROM tempTable2"))
    
    // ... but is this valid?
    uncacheTable("tempTable2")
    
    // Should this still be cached?
    assertCached(sql("SELECT COUNT(*) FROM tempTable1"), 0)
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marmbrus/spark cachingBug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2912.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2912
    
----
commit 03f1cfef01e1a1b6dec9aa1b16bbe4c74c2bed88
Author: Michael Armbrust <[email protected]>
Date:   2014-10-23T20:02:42Z

    Clean-up / add tests to SameResult suite.

commit 63a23e4903c6c17051e321b4ec36a7d199e31cb4
Author: Michael Armbrust <[email protected]>
Date:   2014-10-23T20:03:25Z

    Perform caching on analyzed instead of optimized plan.

commit 5c72fb71eed3df8e58060b3847e13018c621b910
Author: Michael Armbrust <[email protected]>
Date:   2014-10-23T20:08:02Z

    Add a test case / question about uncaching semantics.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to