Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sean Owen
On Mon, Apr 28, 2014 at 8:22 AM, Sung Hwan Chung coded...@cs.stanford.eduwrote: e.g. something like rdd.mapPartition((rows : Iterator[String]) = { var idx = 0 rows.map((row: String) = { val valueMap = SparkWorker.getMemoryContent(valMap) val prevVal = valueMap(idx) idx +=

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sung Hwan Chung
Actually, I do not know how to do something like this or whether this is possible - thus my suggestive statement. Can you already declare persistent memory objects per worker? I tried something like constructing a singleton object within map functions, but that didn't work as it seemed to

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Ian O'Connell
A mutable map in an object should do what your looking for then I believe. You just reference the object as an object in your closure so it won't be swept up when your closure is serialized and you can reference variables of the object on the remote host then. e.g.: object MyObject { val mmap =

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
As to your last line: I've used RDD zipping to avoid GC since MyBaseData is large and doesn't change. I think this is a very good solution to what is being asked for. On Mon, Apr 28, 2014 at 10:44 AM, Ian O'Connell i...@ianoconnell.com wrote: A mutable map in an object should do what your

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
I'm not sure what I said came through. RDD zip is not hacky at all, as it only depends on a user not changing the partitioning. Basically, you would keep your losses as an RDD[Double] and zip whose with the RDD of examples, and update the losses. You're doing a copy (and GC) on the RDD of

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
Right---They are zipped at each iteration. On Mon, Apr 28, 2014 at 11:56 AM, Chester Chen chesterxgc...@yahoo.comwrote: Tom, Are you suggesting two RDDs, one with loss and another for the rest info, using zip to tie them together, but do update on loss RDD (copy) ? Chester Sent from

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
Ian, I tried playing with your suggestion, but I get a task not serializable error (and some obvious things didn't fix it). Can you get that working? On Mon, Apr 28, 2014 at 10:58 AM, Tom Vacek minnesota...@gmail.com wrote: As to your last line: I've used RDD zipping to avoid GC since

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sung Hwan Chung
That might be a good alternative to what we are looking for. But I wonder if this would be as efficient as we want to. For instance, will RDDs of the same size usually get partitioned to the same machines - thus not triggering any cross machine aligning, etc. We'll explore it, but I would still

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
If you create your auxiliary RDD as a map from the examples, the partitioning will be inherited. On Mon, Apr 28, 2014 at 12:38 PM, Sung Hwan Chung coded...@cs.stanford.eduwrote: That might be a good alternative to what we are looking for. But I wonder if this would be as efficient as we want

Re: is it okay to reuse objects across RDD's?

2014-04-27 Thread DB Tsai
Hi Todd, As Patrick and you already pointed out, it's really dangerous to mutate the status of RDD. However, when we implement the glmnet in Spark, if we can reuse the residuals for each row in RDD computed from the previous step, it can speed up 4~5x. As a result, we add extra column in RDD for