[Lucene-hadoop Wiki] Trivial Update of "Hbase/MapReduce" by stack

Apache Wiki Sat, 24 Nov 2007 09:32:41 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.

The following page has been changed by stack:
http://wiki.apache.org/lucene-hadoop/Hbase/MapReduce

The comment on the change is:
Edit

------------------------------------------------------------------------------
= Hbase, MapReduce and the CLASSPATH =

- An hbase cluster configuration is made up of the hbase particulars found at
''$HBASE_CONF_DIR'' -- default location is ''$HBASE_HOME/conf'' -- and the
hadoop configuration in ''$HADOOP_CONF_DIR'', usually ''$HADOOP_HOME/conf''.
When hbase start/stop scripts run, they will read ''$HBASE_CONF_DIR'' content
and then that of ''$HADOOP_CONF_DIR''.
+ An hbase cluster configuration is made up of an aggregation of the hbase
particulars found at ''$HBASE_CONF_DIR'' -- default location is
''$HBASE_HOME/conf'' -- and the hadoop configuration in ''$HADOOP_CONF_DIR'',
usually ''$HADOOP_HOME/conf''. When hbase start/stop scripts run, they will
read ''$HBASE_CONF_DIR'' content and then that of ''$HADOOP_CONF_DIR''.

- M!apReduce job jars deployed to a mapreduce cluster do not usually have
access to ''$HBASE_CONF_DIR''. Any hbase particular configuration not
hard-coded into the job jar classes -- e.g. the address of the target hbase
master -- needs to be either included explicitly in the job jar, by adding an
''hbase-site.xml'' under a conf subdirectory, or added to a hbase-site.xml
under ''$HADOOP_HOME/conf'' and copied across the mapreduce cluster.
+ M!apReduce job jars deployed to a mapreduce cluster do not usually have
access to ''$HBASE_CONF_DIR''. Any hbase particular configuration not
hard-coded into the job jar classes -- e.g. the address of the target hbase
master -- needs to be either included explicitly in the job jar, by jarring an
''hbase-site.xml'' into a conf subdirectory, or adding a hbase-site.xml under
''$HADOOP_HOME/conf'' and copying it across the mapreduce cluster.

- The same holds true for any hbase classes referenced by the mapreduce job
jar. By default the hbase classes are not available on the general mapreduce
''CLASSPATH''. You have a couple of options. Either include the
hadoop-X.X.X-hbase.jar in the job jar under the lib subdirectory or copy the
hadoop-X.X.X-hbase.jar to $HADOOP_HOME/lib and copy it across the cluster. But
the cleanest means of adding hbase to the cluster CLASSPATH is by uncommenting
''HADOOP_CLASSPATH'' in ''$HADOOP_HOME/conf/hadoop-env.sh'' adding the path to
the hbase jar, usually ''$HADOOP_HOME/contrib/hbase/hadoop-X.X.X-hbase.jar'',
and then copying the amended configuration across the cluster. You'll need to
restart the mapreduce cluster if you want it to notice the new configuration.
+ The same holds true for any hbase classes referenced by the mapreduce job
jar. By default the hbase classes are not available on the general mapreduce
''CLASSPATH''. To add them, you have a couple of options. Either include the
hadoop-X.X.X-hbase.jar in the job jar under the lib subdirectory or copy the
hadoop-X.X.X-hbase.jar to $HADOOP_HOME/lib and copy it across the cluster. But
the cleanest means of adding hbase to the cluster CLASSPATH is by uncommenting
''HADOOP_CLASSPATH'' in ''$HADOOP_HOME/conf/hadoop-env.sh'' adding the path to
the hbase jar, usually ''$HADOOP_HOME/contrib/hbase/hadoop-X.X.X-hbase.jar'',
and then copying the amended configuration across the cluster. You'll need to
restart the mapreduce cluster if you want it to notice the new configuration.

= Hbase as MapReduce job data source and sink =

+ Hbase can be used as a data source,
[http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableInputFormat.html
TableInputFormat], and data sink,
[http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableOutputFormat.html
TableOutputFormat], for mapreduce jobs. Writing mapreduce jobs that read or
write hbase, you'll probably want to subclass
[http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableMap.html
TableMap] and/or
[http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableReduce.html
TableReduce]. See the do-nothing passthrough classes
[http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/IdentityTableMap.html
IdentityTableMap] and
[http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase
/mapred/IdentityTableReduce.html IdentityTableReduce] for basic usage. For a
more involved example, see
[http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/BuildTableIndex.html
BuildTableIndex] from the same package or review the
org.apache.hadoop.hbase.mapred.TestTableMapReduce unit test.
- Hbase can be used as a data source and data sink for
-
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableInputFormat.html
-
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableOutputFormat.html
-
- Good to have lots of reducers

- Need to add hbase lib?
+ Running mapreduce jobs that have hbase as source or sink, you'll need to
specify source/sink table and column names in your configuration.

- Note on how the splits are done and the config. needed.
+ Reading from hbase, the TableInputFormat asks hbase for the list of regions
and makes a map-per-region. Writing, its better to have lots of reducers so
load is spread across the hbase cluster.

[Lucene-hadoop Wiki] Trivial Update of "Hbase/MapReduce" by stack

Reply via email to