Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.
The following page has been changed by stack: http://wiki.apache.org/lucene-hadoop/Hbase/MapReduce The comment on the change is: Edit ------------------------------------------------------------------------------ = Hbase, MapReduce and the CLASSPATH = - An hbase cluster configuration is made up of the hbase particulars found at ''$HBASE_CONF_DIR'' -- default location is ''$HBASE_HOME/conf'' -- and the hadoop configuration in ''$HADOOP_CONF_DIR'', usually ''$HADOOP_HOME/conf''. When hbase start/stop scripts run, they will read ''$HBASE_CONF_DIR'' content and then that of ''$HADOOP_CONF_DIR''. + An hbase cluster configuration is made up of an aggregation of the hbase particulars found at ''$HBASE_CONF_DIR'' -- default location is ''$HBASE_HOME/conf'' -- and the hadoop configuration in ''$HADOOP_CONF_DIR'', usually ''$HADOOP_HOME/conf''. When hbase start/stop scripts run, they will read ''$HBASE_CONF_DIR'' content and then that of ''$HADOOP_CONF_DIR''. - M!apReduce job jars deployed to a mapreduce cluster do not usually have access to ''$HBASE_CONF_DIR''. Any hbase particular configuration not hard-coded into the job jar classes -- e.g. the address of the target hbase master -- needs to be either included explicitly in the job jar, by adding an ''hbase-site.xml'' under a conf subdirectory, or added to a hbase-site.xml under ''$HADOOP_HOME/conf'' and copied across the mapreduce cluster. + M!apReduce job jars deployed to a mapreduce cluster do not usually have access to ''$HBASE_CONF_DIR''. Any hbase particular configuration not hard-coded into the job jar classes -- e.g. the address of the target hbase master -- needs to be either included explicitly in the job jar, by jarring an ''hbase-site.xml'' into a conf subdirectory, or adding a hbase-site.xml under ''$HADOOP_HOME/conf'' and copying it across the mapreduce cluster. - The same holds true for any hbase classes referenced by the mapreduce job jar. By default the hbase classes are not available on the general mapreduce ''CLASSPATH''. You have a couple of options. Either include the hadoop-X.X.X-hbase.jar in the job jar under the lib subdirectory or copy the hadoop-X.X.X-hbase.jar to $HADOOP_HOME/lib and copy it across the cluster. But the cleanest means of adding hbase to the cluster CLASSPATH is by uncommenting ''HADOOP_CLASSPATH'' in ''$HADOOP_HOME/conf/hadoop-env.sh'' adding the path to the hbase jar, usually ''$HADOOP_HOME/contrib/hbase/hadoop-X.X.X-hbase.jar'', and then copying the amended configuration across the cluster. You'll need to restart the mapreduce cluster if you want it to notice the new configuration. + The same holds true for any hbase classes referenced by the mapreduce job jar. By default the hbase classes are not available on the general mapreduce ''CLASSPATH''. To add them, you have a couple of options. Either include the hadoop-X.X.X-hbase.jar in the job jar under the lib subdirectory or copy the hadoop-X.X.X-hbase.jar to $HADOOP_HOME/lib and copy it across the cluster. But the cleanest means of adding hbase to the cluster CLASSPATH is by uncommenting ''HADOOP_CLASSPATH'' in ''$HADOOP_HOME/conf/hadoop-env.sh'' adding the path to the hbase jar, usually ''$HADOOP_HOME/contrib/hbase/hadoop-X.X.X-hbase.jar'', and then copying the amended configuration across the cluster. You'll need to restart the mapreduce cluster if you want it to notice the new configuration. = Hbase as MapReduce job data source and sink = + Hbase can be used as a data source, [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableInputFormat.html TableInputFormat], and data sink, [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableOutputFormat.html TableOutputFormat], for mapreduce jobs. Writing mapreduce jobs that read or write hbase, you'll probably want to subclass [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableMap.html TableMap] and/or [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableReduce.html TableReduce]. See the do-nothing passthrough classes [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/IdentityTableMap.html IdentityTableMap] and [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase /mapred/IdentityTableReduce.html IdentityTableReduce] for basic usage. For a more involved example, see [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/BuildTableIndex.html BuildTableIndex] from the same package or review the org.apache.hadoop.hbase.mapred.TestTableMapReduce unit test. - Hbase can be used as a data source and data sink for - http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableInputFormat.html - http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/javadoc/org/apache/hadoop/hbase/mapred/TableOutputFormat.html - - Good to have lots of reducers - Need to add hbase lib? + Running mapreduce jobs that have hbase as source or sink, you'll need to specify source/sink table and column names in your configuration. - Note on how the splits are done and the config. needed. + Reading from hbase, the TableInputFormat asks hbase for the list of regions and makes a map-per-region. Writing, its better to have lots of reducers so load is spread across the hbase cluster.