Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.

The following page has been changed by stack:
http://wiki.apache.org/lucene-hadoop/Hbase/MapReduce

The comment on the change is:
Edit

------------------------------------------------------------------------------
  = Hbase, MapReduce and the CLASSPATH =
  
- An hbase cluster configuration is made up of the hbase particulars found at 
''$HBASE_CONF_DIR'' -- default location is ''$HBASE_HOME/conf'' -- and the 
hadoop configuration in ''$HADOOP_CONF_DIR'', usually ''$HADOOP_HOME/conf''.  
When hbase start/stop scripts run, they will read ''$HBASE_CONF_DIR'' content 
and then that of ''$HADOOP_CONF_DIR''.
+ An hbase cluster configuration is made up of an aggregation of the hbase 
particulars found at ''$HBASE_CONF_DIR'' -- default location is 
''$HBASE_HOME/conf'' -- and the hadoop configuration in ''$HADOOP_CONF_DIR'', 
usually ''$HADOOP_HOME/conf''.  When hbase start/stop scripts run, they will 
read ''$HBASE_CONF_DIR'' content and then that of ''$HADOOP_CONF_DIR''.
  
- M!apReduce job jars deployed to a mapreduce cluster do not usually have 
access to ''$HBASE_CONF_DIR''.  Any hbase particular configuration not 
hard-coded into the job jar classes -- e.g. the address of the target hbase 
master -- needs to be either included explicitly in the job jar, by adding an 
''hbase-site.xml'' under a conf subdirectory, or added to a hbase-site.xml 
under ''$HADOOP_HOME/conf'' and copied across the mapreduce cluster.
+ M!apReduce job jars deployed to a mapreduce cluster do not usually have 
access to ''$HBASE_CONF_DIR''.  Any hbase particular configuration not 
hard-coded into the job jar classes -- e.g. the address of the target hbase 
master -- needs to be either included explicitly in the job jar, by jarring an 
''hbase-site.xml'' into a conf subdirectory, or adding a hbase-site.xml under 
''$HADOOP_HOME/conf'' and copying it across the mapreduce cluster.
  
- The same holds true for any hbase classes referenced by the mapreduce job 
jar.  By default the hbase classes are not available on the general mapreduce 
''CLASSPATH''.  You have a couple of options. Either include the 
hadoop-X.X.X-hbase.jar in the job jar under the lib subdirectory or copy the 
hadoop-X.X.X-hbase.jar to $HADOOP_HOME/lib and copy it across the cluster.  But 
the cleanest means of adding hbase to the cluster CLASSPATH is by uncommenting 
''HADOOP_CLASSPATH'' in ''$HADOOP_HOME/conf/hadoop-env.sh'' adding the path to 
the hbase jar, usually ''$HADOOP_HOME/contrib/hbase/hadoop-X.X.X-hbase.jar'', 
and then copying the amended configuration across the cluster.  You'll need to 
restart the mapreduce cluster if you want it to notice the new configuration.
+ The same holds true for any hbase classes referenced by the mapreduce job 
jar.  By default the hbase classes are not available on the general mapreduce 
''CLASSPATH''.  To add them, you have a couple of options. Either include the 
hadoop-X.X.X-hbase.jar in the job jar under the lib subdirectory or copy the 
hadoop-X.X.X-hbase.jar to $HADOOP_HOME/lib and copy it across the cluster.  But 
the cleanest means of adding hbase to the cluster CLASSPATH is by uncommenting 
''HADOOP_CLASSPATH'' in ''$HADOOP_HOME/conf/hadoop-env.sh'' adding the path to 
the hbase jar, usually ''$HADOOP_HOME/contrib/hbase/hadoop-X.X.X-hbase.jar'', 
and then copying the amended configuration across the cluster.  You'll need to 
restart the mapreduce cluster if you want it to notice the new configuration.
  
  = Hbase as MapReduce job data source and sink =
  
+ Hbase can be used as a data source, 
[http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableInputFormat.html
 TableInputFormat], and data sink, 
[http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableOutputFormat.html
 TableOutputFormat], for mapreduce jobs.  Writing mapreduce jobs that read or 
write hbase, you'll probably want to subclass 
[http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableMap.html
 TableMap] and/or 
[http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableReduce.html
 TableReduce].  See the do-nothing passthrough classes 
[http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/IdentityTableMap.html
 IdentityTableMap] and 
[http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase
 /mapred/IdentityTableReduce.html IdentityTableReduce] for basic usage.  For a 
more involved example, see 
[http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/BuildTableIndex.html
 BuildTableIndex] from the same package or review the 
org.apache.hadoop.hbase.mapred.TestTableMapReduce unit test.
- Hbase can be used as a data source and data sink for 
- 
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableInputFormat.html
- 
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableOutputFormat.html
-    
- Good to have lots of reducers
  
- Need to add hbase lib?
+ Running mapreduce jobs that have hbase as source or sink, you'll need to 
specify source/sink table and column names in your configuration.
  
- Note on how the splits are done and the config. needed.
+ Reading from hbase, the TableInputFormat asks hbase for the list of regions 
and makes a map-per-region.  Writing, its better to have lots of reducers so 
load is spread across the hbase cluster.
  

Reply via email to