As you suggested, I tried to save the grouped RDD and persisted it in
memory before the iterations begin. The performance seems to be much better
now.

My previous comment that the run times doubled was from a wrong observation.

Thanks.


On Fri, Feb 27, 2015 at 10:27 AM, Vijayasarathy Kannan <kvi...@vt.edu>
wrote:

> Thanks.
>
> I tried persist() on the RDD. The runtimes appear to have doubled now
> (without persist() it was ~7s per iteration and now its ~15s). I am running
> standalone Spark on a 8-core machine.
> Any thoughts on why the increase in runtime?
>
> On Thu, Feb 26, 2015 at 4:27 PM, Imran Rashid <iras...@cloudera.com>
> wrote:
>
>>
>> val grouped = R.groupBy[VertexId](G).persist(StorageLeve.MEMORY_ONLY_SER)
>> // or whatever persistence makes more sense for you ...
>> while(true) {
>>   val res = grouped.flatMap(F)
>>   res.collect.foreach(func)
>>   if(criteria)
>>      break
>> }
>>
>> On Thu, Feb 26, 2015 at 10:56 AM, Vijayasarathy Kannan <kvi...@vt.edu>
>> wrote:
>>
>>> Hi,
>>>
>>> I have the following use case.
>>>
>>> (1) I have an RDD of edges of a graph (say R).
>>> (2) do a groupBy on R (by say source vertex) and call a function F on
>>> each group.
>>> (3) collect the results from Fs and do some computation
>>> (4) repeat the above steps until some criteria is met
>>>
>>> In (2), the groups are always going to be the same (since R is grouped
>>> by source vertex).
>>>
>>> Question:
>>> Is R distributed every iteration (when in (2)) or is it distributed only
>>> once when it is created?
>>>
>>> A sample code snippet is below.
>>>
>>> while(true) {
>>>   val res = R.groupBy[VertexId](G).flatMap(F)
>>>   res.collect.foreach(func)
>>>   if(criteria)
>>>      break
>>> }
>>>
>>> Since the groups remain the same, what is the best way to go about
>>> implementing the above logic?
>>>
>>
>>
>

Reply via email to