Ian, I tried playing with your suggestion, but I get a task not
serializable error (and some obvious things didn't fix it).  Can you get
that working?


On Mon, Apr 28, 2014 at 10:58 AM, Tom Vacek <minnesota...@gmail.com> wrote:

> As to your last line: I've used RDD zipping to avoid GC since MyBaseData
> is large and doesn't change.  I think this is a very good solution to what
> is being asked for.
>
>
> On Mon, Apr 28, 2014 at 10:44 AM, Ian O'Connell <i...@ianoconnell.com>wrote:
>
>> A mutable map in an object should do what your looking for then I
>> believe. You just reference the object as an object in your closure so it
>> won't be swept up when your closure is serialized and you can reference
>> variables of the object on the remote host then. e.g.:
>>
>> object MyObject {
>>   val mmap = scala.collection.mutable.Map[Long, Long]()
>> }
>>
>> rdd.map { ele =>
>> MyObject.mmap.getOrElseUpdate(ele, 1L)
>> ...
>> }.map {ele =>
>> require(MyObject.mmap(ele) == 1L)
>>
>> }.count
>>
>> Along with the data loss just be careful with thread safety and multiple
>> threads/partitions on one host so the map should be viewed as shared
>> amongst a larger space.
>>
>>
>>
>> Also with your exact description it sounds like your data should be
>> encoded into the RDD if its per-record/per-row:  RDD[(MyBaseData,
>> LastIterationSideValues)]
>>
>>
>>
>> On Mon, Apr 28, 2014 at 1:51 AM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>>> In our case, we'd like to keep memory content from one iteration to the
>>> next, and not just during a single mapPartition call because then we can do
>>> more efficient computations using the values from the previous iteration.
>>>
>>> So essentially, we need to declare objects outside the scope of the
>>> map/reduce calls (but residing in individual workers), then those can be
>>> accessed from the map/reduce calls.
>>>
>>> We'd be making some assumptions as you said, such as - RDD partition is
>>> statically located and can't move from worker to another worker unless the
>>> worker crashes.
>>>
>>>
>>>
>>> On Mon, Apr 28, 2014 at 1:35 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung <
>>>> coded...@cs.stanford.edu> wrote:
>>>>
>>>>> Actually, I do not know how to do something like this or whether this
>>>>> is possible - thus my suggestive statement.
>>>>>
>>>>> Can you already declare persistent memory objects per worker? I tried
>>>>> something like constructing a singleton object within map functions, but
>>>>> that didn't work as it seemed to actually serialize singletons and pass it
>>>>> back and forth in a weird manner.
>>>>>
>>>>>
>>>> Does it need to be persistent across operations, or just persist for
>>>> the lifetime of processing of one partition in one mapPartition? The latter
>>>> is quite easy and might give most of the speedup.
>>>>
>>>> Maybe that's 'enough', even if it means you re-cache values several
>>>> times in a repeated iterative computation. It would certainly avoid
>>>> managing a lot of complexity in trying to keep that state alive remotely
>>>> across operations. I'd also be interested if there is any reliable way to
>>>> do that, though it seems hard since it means you embed assumptions about
>>>> where particular data is going to be processed.
>>>>
>>>>
>>>
>>
>

Reply via email to