Using Hbase as data sink

Jim Twensky Mon, 22 Dec 2008 12:38:09 -0800

Hello,
I have an application which is similar to the word count example given on
the Hadoop Map/Reduce tutorial. Instead of counting the words however, I
count the phrases. That is, for a sentence like:


"How are you"

I emit the following phrases inside my mapper:

How 1
How are 1
How are you 1
are 1
are you 1
you 1

and then inside the reducer, I aggregate the same keys and send them to an
output file.

However, I want to load these (phrase,count) pairs to Hbase instead of
storing them in a file. I've already written the code and it works but I
have some concerns about its performance and I'm not sure if this is the
right way to do it. Here is how my reducer looks:

public class Reduce extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws
IOException {
        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }

        HBaseConfiguration conf = new HBaseConfiguration();
        HTable table = new HTable(conf, "phrases");

        String row = key.toString();
        BatchUpdate update = new BatchUpdate(row);
        update.put("counters:value", Bytes.toBytes(sum));
        table.commit(update);

    }
}

Now as you can see, my reducer has an output collecter of type
<Text,IntWritable> but I don't call the output collector at all. Instead I
load the data to Hbase via table.commit.
I also use NullOutputFormat to avoid getting any empty output files.

This code works and does what I want but I'm not convinced that this is the
right way to do it. I tried to go over the example codes like
BuildTableIndex.java and the others but all of them already had reducers of
the following form:

reduce(ImmutableBytesWritable key, Iterator<RowResult> values,
      OutputCollector<ImmutableBytesWritable, LuceneDocumentWrapper> output,
      @SuppressWarnings("unused") Reporter reporter)

which is not how I get the intermadiate key,value pairs from the mapper into
the reducer.

Can you give me some advice and a few lines of sample source code if there
is a way to load the data using an output collector? Basically, I'm confused
on how to specify the reducer input parameters and which class to subclass?
I posted this to the mailing list because I couldn't find any more examples
anywhere else.

Thanks in advance,
Jim

Using Hbase as data sink

Reply via email to