GitHub user maropu opened a pull request:

    [SPARK-23880][SQL] Do not trigger any jobs for caching data

    ## What changes were proposed in this pull request?
    This pr fixed code so that `cache` could prevent any jobs from being 
    For example, in the current master, an operation below triggers a actual 
    val df = spark.range(10000000000L)
      .filter('id > 1000)
    This triggers a job while the cache should be lazy. The problem is that, 
when creating `InMemoryRelation`, we build the RDD, which calls 
`SparkPlan.execute` and may trigger jobs, like sampling job for range 
partitioner, or broadcast job.
    This fix do not build a `RDD` in the constructor of `InMemoryRelation`. 
Then, `InMemoryTableScanExec` materializes the cache and updates the entry in 
    ## How was this patch tested?
    Added tests in `CachedTableSuite`.

You can merge this pull request into a Git repository by running:

    $ git pull SPARK-23880

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21018
commit 01d75d789c45f73bd999106dfc6f29cdc3050ce9
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-04-09T09:30:10Z




To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to