Sweet, that's probably it. Too bad it didn't seem to make 1.1? On Wed, Sep 17, 2014 at 5:32 PM, Michael Armbrust <mich...@databricks.com> wrote: > The unknown slowdown might be addressed by > https://github.com/apache/spark/commit/f858f466862541c3faad76a1fa2391f1c17ec9dd > > On Sun, Sep 14, 2014 at 10:40 PM, Evan Chan <velvia.git...@gmail.com> wrote: >> >> SPARK-1671 looks really promising. >> >> Note that even right now, you don't need to un-cache the existing >> table. You can do something like this: >> >> newAdditionRdd.registerTempTable("table2") >> sqlContext.cacheTable("table2") >> val unionedRdd = >> sqlContext.table("table1").unionAll(sqlContext.table("table2")) >> >> When you use "table", it will return you the cached representation, so >> that the union executes much faster. >> >> However, there is some unknown slowdown, it's not quite as fast as >> what you would expect. >> >> On Fri, Sep 12, 2014 at 2:09 PM, Cheng Lian <lian.cs....@gmail.com> wrote: >> > Ah, I see. So basically what you need is something like cache write >> > through >> > support which exists in Shark but not implemented in Spark SQL yet. In >> > Shark, when inserting data into a table that has already been cached, >> > the >> > newly inserted data will be automatically cached and “union”-ed with the >> > existing table content. SPARK-1671 was created to track this feature. >> > We’ll >> > work on that. >> > >> > Currently, as a workaround, instead of doing union at the RDD level, you >> > may >> > try cache the new table, union it with the old table and then query the >> > union-ed table. The drawbacks is higher code complexity and you end up >> > with >> > lots of temporary tables. But the performance should be reasonable. >> > >> > >> > On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur >> > <archit279tha...@gmail.com> >> > wrote: >> >> >> >> LittleCode snippet: >> >> >> >> line1: cacheTable(existingRDDTableName) >> >> line2: //some operations which will materialize existingRDD dataset. >> >> line3: >> >> existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName) >> >> line4: cacheTable(new_existingRDDTableName) >> >> line5: //some operation that will materialize new _existingRDD. >> >> >> >> now, what we expect is in line4 rather than caching both >> >> existingRDDTableName and new_existingRDDTableName, it should cache only >> >> new_existingRDDTableName. but we cannot explicitly uncache >> >> existingRDDTableName because we want the union to use the cached >> >> existingRDDTableName. since being lazy new_existingRDDTableName could >> >> be >> >> materialized later and by then we cant lose existingRDDTableName from >> >> cache. >> >> >> >> What if keep the same name of the new table >> >> >> >> so, cacheTable(existingRDDTableName) >> >> existingRDD.union(newRDD).registerAsTable(existingRDDTableName) >> >> cacheTable(existingRDDTableName) //might not be needed again. >> >> >> >> Will our both cases be satisfied, that it uses existingRDDTableName >> >> from >> >> cache for union and dont duplicate the data in the cache but somehow, >> >> append >> >> to the older cacheTable. >> >> >> >> Thanks and Regards, >> >> >> >> >> >> Archit Thakur. >> >> Sr Software Developer, >> >> Guavus, Inc. >> >> >> >> On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora >> >> <pankajarora.n...@gmail.com> wrote: >> >>> >> >>> I think i should elaborate usecase little more. >> >>> >> >>> So we have UI dashboard whose response time is quite fast as all the >> >>> data >> >>> is >> >>> cached. Users query data based on time range and also there is always >> >>> new >> >>> data coming into the system at predefined frequency lets say 1 hour. >> >>> >> >>> As you said i can uncache tables it will basically drop all data from >> >>> memory. >> >>> I cannot afford losing my cache even for short interval. As all >> >>> queries >> >>> from >> >>> UI will get slow till the time cache loads again. UI response time >> >>> needs >> >>> to >> >>> be predictable and shoudl be fast enough so that user does not get >> >>> irritated. >> >>> >> >>> Also i cannot keep two copies of data(till newrdd materialize) into >> >>> memory >> >>> as it will surpass total available memory in system. >> >>> >> >>> >> >>> >> >>> -- >> >>> View this message in context: >> >>> >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html >> >>> Sent from the Apache Spark User List mailing list archive at >> >>> Nabble.com. >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >>> For additional commands, e-mail: user-h...@spark.apache.org >> >>> >> >> >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org