GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/21018
[SPARK-23880][SQL] Do not trigger any jobs for caching data ## What changes were proposed in this pull request? This pr fixed code so that `cache` could prevent any jobs from being triggered. For example, in the current master, an operation below triggers a actual job; ``` val df = spark.range(10000000000L) .filter('id > 1000) .orderBy('id.desc) .cache() ``` This triggers a job while the cache should be lazy. The problem is that, when creating `InMemoryRelation`, we build the RDD, which calls `SparkPlan.execute` and may trigger jobs, like sampling job for range partitioner, or broadcast job. This fix do not build a `RDD` in the constructor of `InMemoryRelation`. Then, `InMemoryTableScanExec` materializes the cache and updates the entry in `CacheManager`. ## How was this patch tested? Added tests in `CachedTableSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/spark SPARK-23880 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21018.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21018 ---- commit 01d75d789c45f73bd999106dfc6f29cdc3050ce9 Author: Takeshi Yamamuro <yamamuro@...> Date: 2018-04-09T09:30:10Z Fix ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org