Right---They are zipped at each iteration.

On Mon, Apr 28, 2014 at 11:56 AM, Chester Chen <chesterxgc...@yahoo.com>wrote:

> Tom,
>     Are you suggesting two RDDs, one with loss and another for the rest
> info, using zip to tie them together, but do update on loss RDD (copy) ?
>
> Chester
>
> Sent from my iPhone
>
> On Apr 28, 2014, at 9:45 AM, Tom Vacek <minnesota...@gmail.com> wrote:
>
> I'm not sure what I said came through.  RDD zip is not hacky at all, as it
> only depends on a user not changing the partitioning.  Basically, you would
> keep your losses as an RDD[Double] and zip whose with the RDD of examples,
> and update the losses.  You're doing a copy (and GC) on the RDD of losses
> each time, but this is negligible.
>
>
> On Mon, Apr 28, 2014 at 11:33 AM, Sung Hwan Chung <
> coded...@cs.stanford.edu> wrote:
>
>> Yes, this is what we've done as of now (if you read earlier threads). And
>> we were saying that we'd prefer if Spark supported persistent worker memory
>> management in a little bit less hacky way ;)
>>
>>
>> On Mon, Apr 28, 2014 at 8:44 AM, Ian O'Connell <i...@ianoconnell.com>wrote:
>>
>>> A mutable map in an object should do what your looking for then I
>>> believe. You just reference the object as an object in your closure so it
>>> won't be swept up when your closure is serialized and you can reference
>>> variables of the object on the remote host then. e.g.:
>>>
>>> object MyObject {
>>>   val mmap = scala.collection.mutable.Map[Long, Long]()
>>> }
>>>
>>> rdd.map { ele =>
>>> MyObject.mmap.getOrElseUpdate(ele, 1L)
>>> ...
>>> }.map {ele =>
>>> require(MyObject.mmap(ele) == 1L)
>>>
>>> }.count
>>>
>>> Along with the data loss just be careful with thread safety and multiple
>>> threads/partitions on one host so the map should be viewed as shared
>>> amongst a larger space.
>>>
>>>
>>>
>>> Also with your exact description it sounds like your data should be
>>> encoded into the RDD if its per-record/per-row:  RDD[(MyBaseData,
>>> LastIterationSideValues)]
>>>
>>>
>>>
>>> On Mon, Apr 28, 2014 at 1:51 AM, Sung Hwan Chung <
>>> coded...@cs.stanford.edu> wrote:
>>>
>>>> In our case, we'd like to keep memory content from one iteration to the
>>>> next, and not just during a single mapPartition call because then we can do
>>>> more efficient computations using the values from the previous iteration.
>>>>
>>>> So essentially, we need to declare objects outside the scope of the
>>>> map/reduce calls (but residing in individual workers), then those can be
>>>> accessed from the map/reduce calls.
>>>>
>>>> We'd be making some assumptions as you said, such as - RDD partition is
>>>> statically located and can't move from worker to another worker unless the
>>>> worker crashes.
>>>>
>>>>
>>>>
>>>> On Mon, Apr 28, 2014 at 1:35 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>> On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung <
>>>>> coded...@cs.stanford.edu> wrote:
>>>>>
>>>>>> Actually, I do not know how to do something like this or whether this
>>>>>> is possible - thus my suggestive statement.
>>>>>>
>>>>>> Can you already declare persistent memory objects per worker? I tried
>>>>>> something like constructing a singleton object within map functions, but
>>>>>> that didn't work as it seemed to actually serialize singletons and pass 
>>>>>> it
>>>>>> back and forth in a weird manner.
>>>>>>
>>>>>>
>>>>> Does it need to be persistent across operations, or just persist for
>>>>> the lifetime of processing of one partition in one mapPartition? The 
>>>>> latter
>>>>> is quite easy and might give most of the speedup.
>>>>>
>>>>> Maybe that's 'enough', even if it means you re-cache values several
>>>>> times in a repeated iterative computation. It would certainly avoid
>>>>> managing a lot of complexity in trying to keep that state alive remotely
>>>>> across operations. I'd also be interested if there is any reliable way to
>>>>> do that, though it seems hard since it means you embed assumptions about
>>>>> where particular data is going to be processed.
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to