Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

subscripti...@prismalytics.io Tue, 28 Apr 2015 07:01:57 -0700

Hello Friends:

I generated a Pair RDD with K/V pairs, like so:


>>>
>>> rdd1.take(10) # Show a small sample.
 [(u'2013-10-09', 7.60117302052786),
 (u'2013-10-10', 9.322709163346612),
 (u'2013-10-10', 28.264462809917358),
 (u'2013-10-07', 9.664429530201343),
 (u'2013-10-07', 12.461538461538463),
 (u'2013-10-09', 20.76923076923077),
 (u'2013-10-08', 11.842105263157894),
 (u'2013-10-13', 32.32514177693762),
 (u'2013-10-13', 26.249999999999996),
 (u'2013-10-13', 10.693069306930692)]

Now from the above RDD, I would like to calculate an average of theVALUES for each KEY.

I can do so as shown here, which does work:
*

*>>>**countsByKey = sc.broadcast(rdd1.countByKey()) # SAMPLE OUTPUT ofcountsByKey.value: {u'2013-09-09': 215, u'2013-09-08': 69, ... snip ...}>>> rdd1 = rdd1.reduceByKey(operator.add) # Calculate the numerator(i.e. the SUM).>>> rdd1 = rdd1.map(lambda x: (x[0], x[1]/countsByKey.value[x[0]])) #Divide each SUM by it's denominator (i.e. COUNT)

>>> print(rdd1.collect())
  [(u'2013-10-09', 11.235365503035176),
   (u'2013-10-07', 23.39500642456595),
   ... snip ...
  ]

But I wonder if the above semantics/approach is the optimal one; orwhether perhaps there is a single API call

that handles common use case.

Improvement thoughts welcome. =:)

Thank you,
nmv

Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

Reply via email to