Hello Friends:
I generated a Pair RDD with K/V pairs, like so:
>>>
>>> rdd1.take(10) # Show a small sample.
[(u'2013-10-09', 7.60117302052786),
(u'2013-10-10', 9.322709163346612),
(u'2013-10-10', 28.264462809917358),
(u'2013-10-07', 9.664429530201343),
(u'2013-10-07', 12.461538461538463),
(u'2013-10-09', 20.76923076923077),
(u'2013-10-08', 11.842105263157894),
(u'2013-10-13', 32.32514177693762),
(u'2013-10-13', 26.249999999999996),
(u'2013-10-13', 10.693069306930692)]
Now from the above RDD, I would like to calculate an average of the
VALUES for each KEY.
I can do so as shown here, which does work:
*
*>>>**countsByKey = sc.broadcast(rdd1.countByKey()) # SAMPLE OUTPUT of
countsByKey.value: {u'2013-09-09': 215, u'2013-09-08': 69, ... snip ...}
>>> rdd1 = rdd1.reduceByKey(operator.add) # Calculate the numerator
(i.e. the SUM).
>>> rdd1 = rdd1.map(lambda x: (x[0], x[1]/countsByKey.value[x[0]])) #
Divide each SUM by it's denominator (i.e. COUNT)
>>> print(rdd1.collect())
[(u'2013-10-09', 11.235365503035176),
(u'2013-10-07', 23.39500642456595),
... snip ...
]
But I wonder if the above semantics/approach is the optimal one; or
whether perhaps there is a single API call
that handles common use case.
Improvement thoughts welcome. =:)
Thank you,
nmv