Hello,
I have an application which is similar to the word count example given on
the Hadoop Map/Reduce tutorial. Instead of counting the words however, I
count the phrases. That is, for a sentence like:
"How are you"
I emit the following phrases inside my mapper:
How 1
How are 1
How are you 1
are 1
are you 1
you 1
and then inside the reducer, I aggregate the same keys and send them to an
output file.
However, I want to load these (phrase,count) pairs to Hbase instead of
storing them in a file. I've already written the code and it works but I
have some concerns about its performance and I'm not sure if this is the
right way to do it. Here is how my reducer looks:
public class Reduce extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
HBaseConfiguration conf = new HBaseConfiguration();
HTable table = new HTable(conf, "phrases");
String row = key.toString();
BatchUpdate update = new BatchUpdate(row);
update.put("counters:value", Bytes.toBytes(sum));
table.commit(update);
}
}
Now as you can see, my reducer has an output collecter of type
<Text,IntWritable> but I don't call the output collector at all. Instead I
load the data to Hbase via table.commit.
I also use NullOutputFormat to avoid getting any empty output files.
This code works and does what I want but I'm not convinced that this is the
right way to do it. I tried to go over the example codes like
BuildTableIndex.java and the others but all of them already had reducers of
the following form:
reduce(ImmutableBytesWritable key, Iterator<RowResult> values,
OutputCollector<ImmutableBytesWritable, LuceneDocumentWrapper> output,
@SuppressWarnings("unused") Reporter reporter)
which is not how I get the intermadiate key,value pairs from the mapper into
the reducer.
Can you give me some advice and a few lines of sample source code if there
is a way to load the data using an output collector? Basically, I'm confused
on how to specify the reducer input parameters and which class to subclass?
I posted this to the mailing list because I couldn't find any more examples
anywhere else.
Thanks in advance,
Jim