Spark 1.5.2
dfOld.registerTempTable("oldTableName")
sqlContext.cacheTable("oldTableName")
//
// do something
//
dfNew.registerTempTable("oldTableName")
sqlContext.cacheTable("oldTableName")
Now when I use the "oldTableName" table I do get the latest contents
from dfNew but do the
CacheManager#cacheQuery() is called where:
* Caches the data produced by the logical representation of the given
[[Queryable]].
...
val planToCache = query.queryExecution.analyzed
if (lookupCachedData(planToCache).nonEmpty) {
Is the schema for dfNew different from that of dfOld ?
This method in CacheManager:
private[sql] def lookupCachedData(plan: LogicalPlan): Option[CachedData]
= readLock {
cachedData.find(cd => plan.sameResult(cd.plan))
Ied me to the following in
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
:
def
Thanks Ted!
Yes, The schema might be different or the same.
What would be the answer for each situation?
On Fri, Dec 18, 2015 at 6:02 PM, Ted Yu wrote:
> CacheManager#cacheQuery() is called where:
> * Caches the data produced by the logical representation of the given
>
So I looked at the function, my only worry is that the cache should be
cleared if I'm overwriting the cache with the same table name. I did this
experiment and the cache shows as table not cached but want to confirm
this. In addition to not using the old table values is it actually
When second attempt is made to cache df3 which has same schema as the first
DataFrame, you would see the warning below:
scala> sqlContext.cacheTable("t1")
scala> sqlContext.isCached("t1")
res5: Boolean = true
scala> sqlContext.sql("select * from t1").show
+---+---+
| a| b|
+---+---+
| 1| 1|
>From the UI I see two rows for this on a streaming application:
RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize in
ExternalBlockStoreSize on DiskIn-memory table myColorsTableMemory
Deserialized 1x Replicated2100%728.2 KB0.0 B0.0 BIn-memory table
myColorsTableMemory