I don't think directy .aggregateByKey() can be done, because we will need count of keys (for average). Maybe we can use .countByKey() which returns a map and .foldByKey(0)(_+_) (or aggregateByKey()) which gives sum of values per key. I myself ain't getting how to proceed.
Regards On Fri, Oct 31, 2014 at 1:26 PM, qinwei <wei....@dewmobile.net> wrote: > Hi, everyone > I have an RDD filled with data like > (k1, v11) > (k1, v12) > (k1, v13) > (k2, v21) > (k2, v22) > (k2, v23) > ... > > I want to calculate the average and standard deviation of (v11, v12, > v13) and (v21, v22, v23) group by there keys > for the moment, i have done that by using groupByKey and map, I notice > that groupByKey is very expensive, but i can not figure out how to do it > by using aggregateByKey, so i wonder is there any better way to do this? > > Thanks! > > ------------------------------ > qinwei >