How do I generate a histogram?

W.P. McNeill Mon, 09 May 2011 13:13:21 -0700

I have a set of (key, value) pairs. For each value there is a function
f(value) that returns an integer. I want to generate a histogram over
f(value) for my data set.  For example, representing the values as
[f(value)] if I have the data set


key1, [3]
key2, [4]
key3, [3]
key4, [5]

I'd want to produce

3, 2
4, 1
5, 1

because f(value) = 3 appears twice in my data set while f(value) = 4 and
f(value) = 5 each appears once.

I gather the right way to do this is to use the Aggregator framework, but I
can't understand the documentation.  I've read the API docs for the
ValueAggregatorDescriptor<http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/aggregate/ValueAggregatorDescriptor.html>and
related classes and looked at the Aggreate*.java files in the examples
directory, but it's still not making sense to me.  (The may in part be due
to the fact that the examples are still for the old API while I'm working in
the new API, though I'm not sure.)

Can someone point me to clearer documentation online or in print, or provide
a simple example for my task?

Thanks.

How do I generate a histogram?

Reply via email to