[Hadoop Wiki] Trivial Update of "Hbase/MapReduce" by stack

Apache Wiki Thu, 24 Jul 2008 14:28:42 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by stack:
http://wiki.apache.org/hadoop/Hbase/MapReduce

The comment on the change is:
Add in sample uploading from out of Map Task (from Andrew Purtell)

------------------------------------------------------------------------------
  
  Running mapreduce jobs that have hbase as source or sink, you'll need to 
specify source/sink table and column names in your configuration.
  
- Reading from hbase, the !TableInputFormat asks hbase for the list of regions 
and makes a map-per-region.  Writing, it may make sense to avoid the reduce 
step and write back into hbase from inside your map.  You'd do this when your 
job does not need the sort and collation that MR does inside in its reduce; on 
insert, hbase sorts so no point double-sorting (and shuffling data around your 
MR cluster) unless you need to.  If you do not need the reduce, you might just 
have your map emit counts of records processed just so the framework can emit 
that nice report of records processed when the job is done.  If running the 
reduce step makes sense in  your case, its better to have lots of reducers so 
load is spread across the hbase cluster.
+ Reading from hbase, the !TableInputFormat asks hbase for the list of regions 
and makes a map-per-region.  Writing, it may make sense to avoid the reduce 
step and write back into hbase from inside your map.  You'd do this when your 
job does not need the sort and collation that MR does inside in its reduce; on 
insert, hbase sorts so no point double-sorting (and shuffling data around your 
MR cluster) unless you need to.  If you do not need the reduce, you might just 
have your map emit counts of records processed just so the framework can emit 
that nice report of records processed when the job is done.  See example code 
below.  If running the reduce step makes sense in  your case, its better to 
have lots of reducers so load is spread across the hbase cluster.
+ 
+ == Sample running HBase inserts out of Map Task ==
+ Here's sample code from Andrew Purtell that does HBase insert inside in the 
mapper rather than via TableReduce.
+ {{{
+ public class MyMap 
+   extends TableMap<ImmutableBytesWritable,MapWritable> // or whatever
+ {
+   private HTable table;
+ 
+   public void configure(JobConf job) {
+     super.configure(job);
+     try {
+       HBaseConfiguration conf = new HBaseConfiguration(job);
+       table = new HTable(conf, "mytable");
+     } catch (Exception) {
+       // can't do anything about this now
+     }
+   }
+ 
+   public void map(ImmutableBytesWritable key, RowResult value,
+     OutputCollector<ImmutableBytesWritable,MapWritable> output,
+     Reporter reporter) throws IOException
+   {
+     // now we can report an exception opening the table
+     if (table == null)
+       throw new IOException("could not open mytable");
+ 
+     // ...
+ 
+     // commit the result
+     BatchUpdate update = new BatchUpdate();
+     // ...
+     table.commit(update);
+   }
+ }
+ }}}
+ This assumes that you do this when setting up your job: {{{JobConf conf = new 
JobConf(new HBaseConfiguration());}}}
+ 
+ Or maybe something like this:
+ 
+    {{{JobConf conf = new JobConf(new Configuration());
+ conf.set("hbase.master", myMaster);}}}
+ 
  
  = Sample MR+HBase Jobs =
  A 
[http://www.nabble.com/Re%3A-Map-Reduce-over-HBase---sample-code-p18253120.html 
students/classes example] by Naama Kraus.

[Hadoop Wiki] Trivial Update of "Hbase/MapReduce" by stack

Reply via email to