Re: Use Case of mutable RDD - any ideas around will help.

Evan Chan Thu, 18 Sep 2014 23:24:59 -0700

Sweet, that's probably it.   Too bad it didn't seem to make 1.1?

On Wed, Sep 17, 2014 at 5:32 PM, Michael Armbrust
<mich...@databricks.com> wrote:
> The unknown slowdown might be addressed by
> https://github.com/apache/spark/commit/f858f466862541c3faad76a1fa2391f1c17ec9dd
>
> On Sun, Sep 14, 2014 at 10:40 PM, Evan Chan <velvia.git...@gmail.com> wrote:
>>
>> SPARK-1671 looks really promising.
>>
>> Note that even right now, you don't need to un-cache the existing
>> table.   You can do something like this:
>>
>> newAdditionRdd.registerTempTable("table2")
>> sqlContext.cacheTable("table2")
>> val unionedRdd =
>> sqlContext.table("table1").unionAll(sqlContext.table("table2"))
>>
>> When you use "table", it will return you the cached representation, so
>> that the union executes much faster.
>>
>> However, there is some unknown slowdown, it's not quite as fast as
>> what you would expect.
>>
>> On Fri, Sep 12, 2014 at 2:09 PM, Cheng Lian <lian.cs....@gmail.com> wrote:
>> > Ah, I see. So basically what you need is something like cache write
>> > through
>> > support which exists in Shark but not implemented in Spark SQL yet. In
>> > Shark, when inserting data into a table that has already been cached,
>> > the
>> > newly inserted data will be automatically cached and “union”-ed with the
>> > existing table content. SPARK-1671 was created to track this feature.
>> > We’ll
>> > work on that.
>> >
>> > Currently, as a workaround, instead of doing union at the RDD level, you
>> > may
>> > try cache the new table, union it with the old table and then query the
>> > union-ed table. The drawbacks is higher code complexity and you end up
>> > with
>> > lots of temporary tables. But the performance should be reasonable.
>> >
>> >
>> > On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur
>> > <archit279tha...@gmail.com>
>> > wrote:
>> >>
>> >> LittleCode snippet:
>> >>
>> >> line1: cacheTable(existingRDDTableName)
>> >> line2: //some operations which will materialize existingRDD dataset.
>> >> line3:
>> >> existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName)
>> >> line4: cacheTable(new_existingRDDTableName)
>> >> line5: //some operation that will materialize new _existingRDD.
>> >>
>> >> now, what we expect is in line4 rather than caching both
>> >> existingRDDTableName and new_existingRDDTableName, it should cache only
>> >> new_existingRDDTableName. but we cannot explicitly uncache
>> >> existingRDDTableName because we want the union to use the cached
>> >> existingRDDTableName. since being lazy new_existingRDDTableName could
>> >> be
>> >> materialized later and by then we cant lose existingRDDTableName from
>> >> cache.
>> >>
>> >> What if keep the same name of the new table
>> >>
>> >> so, cacheTable(existingRDDTableName)
>> >> existingRDD.union(newRDD).registerAsTable(existingRDDTableName)
>> >> cacheTable(existingRDDTableName) //might not be needed again.
>> >>
>> >> Will our both cases be satisfied, that it uses existingRDDTableName
>> >> from
>> >> cache for union and dont duplicate the data in the cache but somehow,
>> >> append
>> >> to the older cacheTable.
>> >>
>> >> Thanks and Regards,
>> >>
>> >>
>> >> Archit Thakur.
>> >> Sr Software Developer,
>> >> Guavus, Inc.
>> >>
>> >> On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora
>> >> <pankajarora.n...@gmail.com> wrote:
>> >>>
>> >>> I think i should elaborate usecase little more.
>> >>>
>> >>> So we have UI dashboard whose response time is quite fast as all the
>> >>> data
>> >>> is
>> >>> cached. Users query data based on time range and also there is always
>> >>> new
>> >>> data coming into the system at predefined frequency lets say 1 hour.
>> >>>
>> >>> As you said i can uncache tables it will basically drop all data from
>> >>> memory.
>> >>> I cannot afford losing my cache even for short interval. As all
>> >>> queries
>> >>> from
>> >>> UI will get slow till the time cache loads again. UI response time
>> >>> needs
>> >>> to
>> >>> be predictable and shoudl be fast enough so that user does not get
>> >>> irritated.
>> >>>
>> >>> Also i cannot keep two copies of data(till newrdd materialize) into
>> >>> memory
>> >>> as it will surpass total available memory in system.
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>>
>> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html
>> >>> Sent from the Apache Spark User List mailing list archive at
>> >>> Nabble.com.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >>> For additional commands, e-mail: user-h...@spark.apache.org
>> >>>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Use Case of mutable RDD - any ideas around will help.

Reply via email to