Ian, I tried playing with your suggestion, but I get a task not serializable error (and some obvious things didn't fix it). Can you get that working?
On Mon, Apr 28, 2014 at 10:58 AM, Tom Vacek <minnesota...@gmail.com> wrote: > As to your last line: I've used RDD zipping to avoid GC since MyBaseData > is large and doesn't change. I think this is a very good solution to what > is being asked for. > > > On Mon, Apr 28, 2014 at 10:44 AM, Ian O'Connell <i...@ianoconnell.com>wrote: > >> A mutable map in an object should do what your looking for then I >> believe. You just reference the object as an object in your closure so it >> won't be swept up when your closure is serialized and you can reference >> variables of the object on the remote host then. e.g.: >> >> object MyObject { >> val mmap = scala.collection.mutable.Map[Long, Long]() >> } >> >> rdd.map { ele => >> MyObject.mmap.getOrElseUpdate(ele, 1L) >> ... >> }.map {ele => >> require(MyObject.mmap(ele) == 1L) >> >> }.count >> >> Along with the data loss just be careful with thread safety and multiple >> threads/partitions on one host so the map should be viewed as shared >> amongst a larger space. >> >> >> >> Also with your exact description it sounds like your data should be >> encoded into the RDD if its per-record/per-row: RDD[(MyBaseData, >> LastIterationSideValues)] >> >> >> >> On Mon, Apr 28, 2014 at 1:51 AM, Sung Hwan Chung < >> coded...@cs.stanford.edu> wrote: >> >>> In our case, we'd like to keep memory content from one iteration to the >>> next, and not just during a single mapPartition call because then we can do >>> more efficient computations using the values from the previous iteration. >>> >>> So essentially, we need to declare objects outside the scope of the >>> map/reduce calls (but residing in individual workers), then those can be >>> accessed from the map/reduce calls. >>> >>> We'd be making some assumptions as you said, such as - RDD partition is >>> statically located and can't move from worker to another worker unless the >>> worker crashes. >>> >>> >>> >>> On Mon, Apr 28, 2014 at 1:35 AM, Sean Owen <so...@cloudera.com> wrote: >>> >>>> On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung < >>>> coded...@cs.stanford.edu> wrote: >>>> >>>>> Actually, I do not know how to do something like this or whether this >>>>> is possible - thus my suggestive statement. >>>>> >>>>> Can you already declare persistent memory objects per worker? I tried >>>>> something like constructing a singleton object within map functions, but >>>>> that didn't work as it seemed to actually serialize singletons and pass it >>>>> back and forth in a weird manner. >>>>> >>>>> >>>> Does it need to be persistent across operations, or just persist for >>>> the lifetime of processing of one partition in one mapPartition? The latter >>>> is quite easy and might give most of the speedup. >>>> >>>> Maybe that's 'enough', even if it means you re-cache values several >>>> times in a repeated iterative computation. It would certainly avoid >>>> managing a lot of complexity in trying to keep that state alive remotely >>>> across operations. I'd also be interested if there is any reliable way to >>>> do that, though it seems hard since it means you embed assumptions about >>>> where particular data is going to be processed. >>>> >>>> >>> >> >