Re: Using Hbase as data sink

Jim Twensky Mon, 22 Dec 2008 13:20:36 -0800

Hello Jonathan,

Thanks for the fast response. Yes, my question is on other methods to put
the same data layout into HBase from my map reduce jobs. I've seen the
TableOutputFormat but I couldn't find any example usages of it.
Specifically, when we use the FileOutputFormat, a file called part0000 is
created automatically and the output is written into it.  However, when
pulling data into a table, we need to specify the input table name at a
minimum. Also, I think I'll need to use RowResult and ImmutableBytesWritable
objects as input parameters to output.collect but I don't know how to do
that exactly. That's why I was looking over for a few code examples but I
couldn't find much. I'd be glad if you can direct me to some examples or
post here if you have any of your own.


As for the HTable, you are perfectly right. I'll take it outside of the
reducer.

Thanks again,
Jim

On Mon, Dec 22, 2008 at 2:59 PM, Jonathan Gray <[email protected]> wrote:

> Jim,
>
> This looks like a sane way to do what you want.  Is your question strictly
> on other methods to put the same data layout into HBase from the MR job, or
> also about the choice of structure?
>
> As far as how else to use HBase as a data sink, you can make use of
> TableOutputFormat.  In my experience, however, it has been faster to
> directly use the API as you are right now.  However you can actually batch
> your BatchUpdates automatically using TOF.
>
>
> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/mapr
> ed/TableOutputFormat.html<http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/mapred/TableOutputFormat.html>
>
> One thing that will really help with performance is to not get a new HTable
> in each map.  There was a post in the past day or two regarding this.  Pull
> it out into a member of the class, initialize it in the job initialization,
> and just reuse the same one in each reducer task.
>
> JG
>
> > -----Original Message-----
> > From: Jim Twensky [mailto:[email protected]]
> > Sent: Monday, December 22, 2008 12:38 PM
> > To: [email protected]
> > Subject: Using Hbase as data sink
> >
> > Hello,
> > I have an application which is similar to the word count example given
> > on
> > the Hadoop Map/Reduce tutorial. Instead of counting the words however,
> > I
> > count the phrases. That is, for a sentence like:
> >
> > "How are you"
> >
> > I emit the following phrases inside my mapper:
> >
> > How 1
> > How are 1
> > How are you 1
> > are 1
> > are you 1
> > you 1
> >
> > and then inside the reducer, I aggregate the same keys and send them to
> > an
> > output file.
> >
> > However, I want to load these (phrase,count) pairs to Hbase instead of
> > storing them in a file. I've already written the code and it works but
> > I
> > have some concerns about its performance and I'm not sure if this is
> > the
> > right way to do it. Here is how my reducer looks:
> >
> > public class Reduce extends MapReduceBase implements Reducer<Text,
> > IntWritable, Text, IntWritable> {
> >
> >     public void reduce(Text key, Iterator<IntWritable> values,
> > OutputCollector<Text, IntWritable> output, Reporter reporter) throws
> > IOException {
> >         int sum = 0;
> >         while (values.hasNext()) {
> >             sum += values.next().get();
> >         }
> >
> >         HBaseConfiguration conf = new HBaseConfiguration();
> >         HTable table = new HTable(conf, "phrases");
> >
> >         String row = key.toString();
> >         BatchUpdate update = new BatchUpdate(row);
> >         update.put("counters:value", Bytes.toBytes(sum));
> >         table.commit(update);
> >
> >     }
> > }
> >
> > Now as you can see, my reducer has an output collecter of type
> > <Text,IntWritable> but I don't call the output collector at all.
> > Instead I
> > load the data to Hbase via table.commit.
> > I also use NullOutputFormat to avoid getting any empty output files.
> >
> > This code works and does what I want but I'm not convinced that this is
> > the
> > right way to do it. I tried to go over the example codes like
> > BuildTableIndex.java and the others but all of them already had
> > reducers of
> > the following form:
> >
> > reduce(ImmutableBytesWritable key, Iterator<RowResult> values,
> >       OutputCollector<ImmutableBytesWritable, LuceneDocumentWrapper>
> > output,
> >       @SuppressWarnings("unused") Reporter reporter)
> >
> > which is not how I get the intermadiate key,value pairs from the mapper
> > into
> > the reducer.
> >
> > Can you give me some advice and a few lines of sample source code if
> > there
> > is a way to load the data using an output collector? Basically, I'm
> > confused
> > on how to specify the reducer input parameters and which class to
> > subclass?
> > I posted this to the mailing list because I couldn't find any more
> > examples
> > anywhere else.
> >
> > Thanks in advance,
> > Jim
>
>

Re: Using Hbase as data sink

Reply via email to