Repository: hbase Updated Branches: refs/heads/master d9f25e30a -> a1fe1e096
http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/hbase_history.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/hbase_history.xml b/src/main/docbkx/hbase_history.xml new file mode 100644 index 0000000..f7b9064 --- /dev/null +++ b/src/main/docbkx/hbase_history.xml @@ -0,0 +1,41 @@ +<?xml version="1.0" encoding="UTF-8"?> +<appendix + xml:id="hbase.history" + version="5.0" + xmlns="http://docbook.org/ns/docbook" + xmlns:xlink="http://www.w3.org/1999/xlink" + xmlns:xi="http://www.w3.org/2001/XInclude" + xmlns:svg="http://www.w3.org/2000/svg" + xmlns:m="http://www.w3.org/1998/Math/MathML" + xmlns:html="http://www.w3.org/1999/xhtml" + xmlns:db="http://docbook.org/ns/docbook"> + <!--/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +--> + <title>HBase History</title> + <itemizedlist> + <listitem><para>2006: <link xlink:href="http://research.google.com/archive/bigtable.html">BigTable</link> paper published by Google. + </para></listitem> + <listitem><para>2006 (end of year): HBase development starts. + </para></listitem> + <listitem><para>2008: HBase becomes Hadoop sub-project. + </para></listitem> + <listitem><para>2010: HBase becomes Apache top-level project. + </para></listitem> + </itemizedlist> +</appendix> http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/hbck_in_depth.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/hbck_in_depth.xml b/src/main/docbkx/hbck_in_depth.xml new file mode 100644 index 0000000..e2ee34f --- /dev/null +++ b/src/main/docbkx/hbck_in_depth.xml @@ -0,0 +1,237 @@ +<?xml version="1.0" encoding="UTF-8"?> +<appendix + xml:id="hbck.in.depth" + version="5.0" + xmlns="http://docbook.org/ns/docbook" + xmlns:xlink="http://www.w3.org/1999/xlink" + xmlns:xi="http://www.w3.org/2001/XInclude" + xmlns:svg="http://www.w3.org/2000/svg" + xmlns:m="http://www.w3.org/1998/Math/MathML" + xmlns:html="http://www.w3.org/1999/xhtml" + xmlns:db="http://docbook.org/ns/docbook"> + <!--/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +--> + + <title>hbck In Depth</title> + <para>HBaseFsck (hbck) is a tool for checking for region consistency and table integrity problems + and repairing a corrupted HBase. It works in two basic modes -- a read-only inconsistency + identifying mode and a multi-phase read-write repair mode. + </para> + <section> + <title>Running hbck to identify inconsistencies</title> + <para>To check to see if your HBase cluster has corruptions, run hbck against your HBase cluster:</para> + <programlisting language="bourne"> +$ ./bin/hbase hbck +</programlisting> + <para> + At the end of the commands output it prints OK or tells you the number of INCONSISTENCIES + present. You may also want to run run hbck a few times because some inconsistencies can be + transient (e.g. cluster is starting up or a region is splitting). Operationally you may want to run + hbck regularly and setup alert (e.g. via nagios) if it repeatedly reports inconsistencies . + A run of hbck will report a list of inconsistencies along with a brief description of the regions and + tables affected. The using the <code>-details</code> option will report more details including a representative + listing of all the splits present in all the tables. + </para> + <programlisting language="bourne"> +$ ./bin/hbase hbck -details +</programlisting> + <para>If you just want to know if some tables are corrupted, you can limit hbck to identify inconsistencies + in only specific tables. For example the following command would only attempt to check table + TableFoo and TableBar. The benefit is that hbck will run in less time.</para> + <programlisting language="bourne"> +$ ./bin/hbase hbck TableFoo TableBar +</programlisting> + </section> + <section><title>Inconsistencies</title> + <para> + If after several runs, inconsistencies continue to be reported, you may have encountered a + corruption. These should be rare, but in the event they occur newer versions of HBase include + the hbck tool enabled with automatic repair options. + </para> + <para> + There are two invariants that when violated create inconsistencies in HBase: + </para> + <itemizedlist> + <listitem><para>HBaseâs region consistency invariant is satisfied if every region is assigned and + deployed on exactly one region server, and all places where this state kept is in + accordance.</para> + </listitem> + <listitem><para>HBaseâs table integrity invariant is satisfied if for each table, every possible row key + resolves to exactly one region.</para> + </listitem> + </itemizedlist> + <para> + Repairs generally work in three phases -- a read-only information gathering phase that identifies + inconsistencies, a table integrity repair phase that restores the table integrity invariant, and then + finally a region consistency repair phase that restores the region consistency invariant. + Starting from version 0.90.0, hbck could detect region consistency problems report on a subset + of possible table integrity problems. It also included the ability to automatically fix the most + common inconsistency, region assignment and deployment consistency problems. This repair + could be done by using the <code>-fix</code> command line option. These problems close regions if they are + open on the wrong server or on multiple region servers and also assigns regions to region + servers if they are not open. + </para> + <para> + Starting from HBase versions 0.90.7, 0.92.2 and 0.94.0, several new command line options are + introduced to aid repairing a corrupted HBase. This hbck sometimes goes by the nickname + âuberhbckâ. Each particular version of uber hbck is compatible with the HBaseâs of the same + major version (0.90.7 uberhbck can repair a 0.90.4). However, versions <=0.90.6 and versions + <=0.92.1 may require restarting the master or failing over to a backup master. + </para> + </section> + <section><title>Localized repairs</title> + <para> + When repairing a corrupted HBase, it is best to repair the lowest risk inconsistencies first. + These are generally region consistency repairs -- localized single region repairs, that only modify + in-memory data, ephemeral zookeeper data, or patch holes in the META table. + Region consistency requires that the HBase instance has the state of the regionâs data in HDFS + (.regioninfo files), the regionâs row in the hbase:meta table., and regionâs deployment/assignments on + region servers and the master in accordance. Options for repairing region consistency include: + <itemizedlist> + <listitem><para><code>-fixAssignments</code> (equivalent to the 0.90 <code>-fix</code> option) repairs unassigned, incorrectly + assigned or multiply assigned regions.</para> + </listitem> + <listitem><para><code>-fixMeta</code> which removes meta rows when corresponding regions are not present in + HDFS and adds new meta rows if they regions are present in HDFS while not in META.</para> + </listitem> + </itemizedlist> + To fix deployment and assignment problems you can run this command: + </para> + <programlisting language="bourne"> +$ ./bin/hbase hbck -fixAssignments +</programlisting> + <para>To fix deployment and assignment problems as well as repairing incorrect meta rows you can + run this command:</para> + <programlisting language="bourne"> +$ ./bin/hbase hbck -fixAssignments -fixMeta +</programlisting> + <para>There are a few classes of table integrity problems that are low risk repairs. The first two are + degenerate (startkey == endkey) regions and backwards regions (startkey > endkey). These are + automatically handled by sidelining the data to a temporary directory (/hbck/xxxx). + The third low-risk class is hdfs region holes. This can be repaired by using the:</para> + <itemizedlist> + <listitem><para><code>-fixHdfsHoles</code> option for fabricating new empty regions on the file system. + If holes are detected you can use -fixHdfsHoles and should include -fixMeta and -fixAssignments to make the new region consistent.</para> + </listitem> + </itemizedlist> + <programlisting language="bourne"> +$ ./bin/hbase hbck -fixAssignments -fixMeta -fixHdfsHoles +</programlisting> + <para>Since this is a common operation, weâve added a the <code>-repairHoles</code> flag that is equivalent to the + previous command:</para> + <programlisting language="bourne"> +$ ./bin/hbase hbck -repairHoles +</programlisting> + <para>If inconsistencies still remain after these steps, you most likely have table integrity problems + related to orphaned or overlapping regions.</para> + </section> + <section><title>Region Overlap Repairs</title> + <para>Table integrity problems can require repairs that deal with overlaps. This is a riskier operation + because it requires modifications to the file system, requires some decision making, and may + require some manual steps. For these repairs it is best to analyze the output of a <code>hbck -details</code> + run so that you isolate repairs attempts only upon problems the checks identify. Because this is + riskier, there are safeguard that should be used to limit the scope of the repairs. + WARNING: This is a relatively new and have only been tested on online but idle HBase instances + (no reads/writes). Use at your own risk in an active production environment! + The options for repairing table integrity violations include:</para> + <itemizedlist> + <listitem><para><code>-fixHdfsOrphans</code> option for âadoptingâ a region directory that is missing a region + metadata file (the .regioninfo file).</para> + </listitem> + <listitem><para><code>-fixHdfsOverlaps</code> ability for fixing overlapping regions</para> + </listitem> + </itemizedlist> + <para>When repairing overlapping regions, a regionâs data can be modified on the file system in two + ways: 1) by merging regions into a larger region or 2) by sidelining regions by moving data to + âsidelineâ directory where data could be restored later. Merging a large number of regions is + technically correct but could result in an extremely large region that requires series of costly + compactions and splitting operations. In these cases, it is probably better to sideline the regions + that overlap with the most other regions (likely the largest ranges) so that merges can happen on + a more reasonable scale. Since these sidelined regions are already laid out in HBaseâs native + directory and HFile format, they can be restored by using HBaseâs bulk load mechanism. + The default safeguard thresholds are conservative. These options let you override the default + thresholds and to enable the large region sidelining feature.</para> + <itemizedlist> + <listitem><para><code>-maxMerge <n></code> maximum number of overlapping regions to merge</para> + </listitem> + <listitem><para><code>-sidelineBigOverlaps</code> if more than maxMerge regions are overlapping, sideline attempt + to sideline the regions overlapping with the most other regions.</para> + </listitem> + <listitem><para><code>-maxOverlapsToSideline <n></code> if sidelining large overlapping regions, sideline at most n + regions.</para> + </listitem> + </itemizedlist> + + <para>Since often times you would just want to get the tables repaired, you can use this option to turn + on all repair options:</para> + <itemizedlist> + <listitem><para><code>-repair</code> includes all the region consistency options and only the hole repairing table + integrity options.</para> + </listitem> + </itemizedlist> + <para>Finally, there are safeguards to limit repairs to only specific tables. For example the following + command would only attempt to check and repair table TableFoo and TableBar.</para> + <screen language="bourne"> +$ ./bin/hbase hbck -repair TableFoo TableBar +</screen> + <section><title>Special cases: Meta is not properly assigned</title> + <para>There are a few special cases that hbck can handle as well. + Sometimes the meta tableâs only region is inconsistently assigned or deployed. In this case + there is a special <code>-fixMetaOnly</code> option that can try to fix meta assignments.</para> + <screen language="bourne"> +$ ./bin/hbase hbck -fixMetaOnly -fixAssignments +</screen> + </section> + <section><title>Special cases: HBase version file is missing</title> + <para>HBaseâs data on the file system requires a version file in order to start. If this flie is missing, you + can use the <code>-fixVersionFile</code> option to fabricating a new HBase version file. This assumes that + the version of hbck you are running is the appropriate version for the HBase cluster.</para> + </section> + <section><title>Special case: Root and META are corrupt.</title> + <para>The most drastic corruption scenario is the case where the ROOT or META is corrupted and + HBase will not start. In this case you can use the OfflineMetaRepair tool create new ROOT + and META regions and tables. + This tool assumes that HBase is offline. It then marches through the existing HBase home + directory, loads as much information from region metadata files (.regioninfo files) as possible + from the file system. If the region metadata has proper table integrity, it sidelines the original root + and meta table directories, and builds new ones with pointers to the region directories and their + data.</para> + <screen language="bourne"> +$ ./bin/hbase org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair +</screen> + <para>NOTE: This tool is not as clever as uberhbck but can be used to bootstrap repairs that uberhbck + can complete. + If the tool succeeds you should be able to start hbase and run online repairs if necessary.</para> + </section> + <section><title>Special cases: Offline split parent</title> + <para> + Once a region is split, the offline parent will be cleaned up automatically. Sometimes, daughter regions + are split again before their parents are cleaned up. HBase can clean up parents in the right order. However, + there could be some lingering offline split parents sometimes. They are in META, in HDFS, and not deployed. + But HBase can't clean them up. In this case, you can use the <code>-fixSplitParents</code> option to reset + them in META to be online and not split. Therefore, hbck can merge them with other regions if fixing + overlapping regions option is used. + </para> + <para> + This option should not normally be used, and it is not in <code>-fixAll</code>. + </para> + </section> + </section> + +</appendix> http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/mapreduce.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/mapreduce.xml b/src/main/docbkx/mapreduce.xml new file mode 100644 index 0000000..9e9e474 --- /dev/null +++ b/src/main/docbkx/mapreduce.xml @@ -0,0 +1,630 @@ +<?xml version="1.0" encoding="UTF-8"?> +<chapter + xml:id="mapreduce" + version="5.0" + xmlns="http://docbook.org/ns/docbook" + xmlns:xlink="http://www.w3.org/1999/xlink" + xmlns:xi="http://www.w3.org/2001/XInclude" + xmlns:svg="http://www.w3.org/2000/svg" + xmlns:m="http://www.w3.org/1998/Math/MathML" + xmlns:html="http://www.w3.org/1999/xhtml" + xmlns:db="http://docbook.org/ns/docbook"> + <!--/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +--> + + <title>HBase and MapReduce</title> + <para>Apache MapReduce is a software framework used to analyze large amounts of data, and is + the framework used most often with <link + xlink:href="http://hadoop.apache.org/">Apache Hadoop</link>. MapReduce itself is out of the + scope of this document. A good place to get started with MapReduce is <link + xlink:href="http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html" />. MapReduce version + 2 (MR2)is now part of <link + xlink:href="http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/">YARN</link>. </para> + + <para> This chapter discusses specific configuration steps you need to take to use MapReduce on + data within HBase. In addition, it discusses other interactions and issues between HBase and + MapReduce jobs. + <note> + <title>mapred and mapreduce</title> + <para>There are two mapreduce packages in HBase as in MapReduce itself: <filename>org.apache.hadoop.hbase.mapred</filename> + and <filename>org.apache.hadoop.hbase.mapreduce</filename>. The former does old-style API and the latter + the new style. The latter has more facility though you can usually find an equivalent in the older + package. Pick the package that goes with your mapreduce deploy. When in doubt or starting over, pick the + <filename>org.apache.hadoop.hbase.mapreduce</filename>. In the notes below, we refer to + o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if that is what you are using. + </para> + </note> + </para> + + <section + xml:id="hbase.mapreduce.classpath"> + <title>HBase, MapReduce, and the CLASSPATH</title> + <para>By default, MapReduce jobs deployed to a MapReduce cluster do not have access to either + the HBase configuration under <envar>$HBASE_CONF_DIR</envar> or the HBase classes.</para> + <para>To give the MapReduce jobs the access they need, you could add + <filename>hbase-site.xml</filename> to the + <filename><replaceable>$HADOOP_HOME</replaceable>/conf/</filename> directory and add the + HBase JARs to the <filename><replaceable>HADOOP_HOME</replaceable>/conf/</filename> + directory, then copy these changes across your cluster. You could add hbase-site.xml to + $HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib. You would then need to copy + these changes across your cluster or edit + <filename><replaceable>$HADOOP_HOME</replaceable>conf/hadoop-env.sh</filename> and add + them to the <envar>HADOOP_CLASSPATH</envar> variable. However, this approach is not + recommended because it will pollute your Hadoop install with HBase references. It also + requires you to restart the Hadoop cluster before Hadoop can use the HBase data.</para> + <para> Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself. The + dependencies only need to be available on the local CLASSPATH. The following example runs + the bundled HBase <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link> + MapReduce job against a table named <systemitem>usertable</systemitem> If you have not set + the environment variables expected in the command (the parts prefixed by a + <literal>$</literal> sign and curly braces), you can use the actual system paths instead. + Be sure to use the correct version of the HBase JAR for your system. The backticks + (<literal>`</literal> symbols) cause ths shell to execute the sub-commands, setting the + CLASSPATH as part of the command. This example assumes you use a BASH-compatible shell. </para> + <screen language="bourne">$ <userinput>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter usertable</userinput></screen> + <para>When the command runs, internally, the HBase JAR finds the dependencies it needs for + zookeeper, guava, and its other dependencies on the passed <envar>HADOOP_CLASSPATH</envar> + and adds the JARs to the MapReduce job configuration. See the source at + TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done. </para> + <note> + <para> The example may not work if you are running HBase from its build directory rather + than an installed location. You may see an error like the following:</para> + <screen>java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper</screen> + <para>If this occurs, try modifying the command as follows, so that it uses the HBase JARs + from the <filename>target/</filename> directory within the build environment.</para> + <screen language="bourne">$ <userinput>HADOOP_CLASSPATH=${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar rowcounter usertable</userinput></screen> + </note> + <caution> + <title>Notice to Mapreduce users of HBase 0.96.1 and above</title> + <para>Some mapreduce jobs that use HBase fail to launch. The symptom is an exception similar + to the following:</para> + <screen> +Exception in thread "main" java.lang.IllegalAccessError: class + com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass + com.google.protobuf.LiteralByteString + at java.lang.ClassLoader.defineClass1(Native Method) + at java.lang.ClassLoader.defineClass(ClassLoader.java:792) + at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) + at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) + at java.net.URLClassLoader.access$100(URLClassLoader.java:71) + at java.net.URLClassLoader$1.run(URLClassLoader.java:361) + at java.net.URLClassLoader$1.run(URLClassLoader.java:355) + at java.security.AccessController.doPrivileged(Native Method) + at java.net.URLClassLoader.findClass(URLClassLoader.java:354) + at java.lang.ClassLoader.loadClass(ClassLoader.java:424) + at java.lang.ClassLoader.loadClass(ClassLoader.java:357) + at + org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(ProtobufUtil.java:818) + at + org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertScanToString(TableMapReduceUtil.java:433) + at + org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:186) + at + org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:147) + at + org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:270) + at + org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100) +... +</screen> + <para>This is caused by an optimization introduced in <link + xlink:href="https://issues.apache.org/jira/browse/HBASE-9867">HBASE-9867</link> that + inadvertently introduced a classloader dependency. </para> + <para>This affects both jobs using the <code>-libjars</code> option and "fat jar," those + which package their runtime dependencies in a nested <code>lib</code> folder.</para> + <para>In order to satisfy the new classloader requirements, hbase-protocol.jar must be + included in Hadoop's classpath. See <xref + linkend="hbase.mapreduce.classpath" /> for current recommendations for resolving + classpath errors. The following is included for historical purposes.</para> + <para>This can be resolved system-wide by including a reference to the hbase-protocol.jar in + hadoop's lib directory, via a symlink or by copying the jar into the new location.</para> + <para>This can also be achieved on a per-job launch basis by including it in the + <code>HADOOP_CLASSPATH</code> environment variable at job submission time. When + launching jobs that package their dependencies, all three of the following job launching + commands satisfy this requirement:</para> + <screen language="bourne"> +$ <userinput>HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput> +$ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput> +$ <userinput>HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass</userinput> + </screen> + <para>For jars that do not package their dependencies, the following command structure is + necessary:</para> + <screen language="bourne"> +$ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',')</userinput> ... + </screen> + <para>See also <link + xlink:href="https://issues.apache.org/jira/browse/HBASE-10304">HBASE-10304</link> for + further discussion of this issue.</para> + </caution> + </section> + + <section> + <title>MapReduce Scan Caching</title> + <para>TableMapReduceUtil now restores the option to set scanner caching (the number of rows + which are cached before returning the result to the client) on the Scan object that is + passed in. This functionality was lost due to a bug in HBase 0.95 (<link + xlink:href="https://issues.apache.org/jira/browse/HBASE-11558">HBASE-11558</link>), which + is fixed for HBase 0.98.5 and 0.96.3. The priority order for choosing the scanner caching is + as follows:</para> + <orderedlist> + <listitem> + <para>Caching settings which are set on the scan object.</para> + </listitem> + <listitem> + <para>Caching settings which are specified via the configuration option + <option>hbase.client.scanner.caching</option>, which can either be set manually in + <filename>hbase-site.xml</filename> or via the helper method + <code>TableMapReduceUtil.setScannerCaching()</code>.</para> + </listitem> + <listitem> + <para>The default value <code>HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING</code>, which is set to + <literal>100</literal>.</para> + </listitem> + </orderedlist> + <para>Optimizing the caching settings is a balance between the time the client waits for a + result and the number of sets of results the client needs to receive. If the caching setting + is too large, the client could end up waiting for a long time or the request could even time + out. If the setting is too small, the scan needs to return results in several pieces. + If you think of the scan as a shovel, a bigger cache setting is analogous to a bigger + shovel, and a smaller cache setting is equivalent to more shoveling in order to fill the + bucket.</para> + <para>The list of priorities mentioned above allows you to set a reasonable default, and + override it for specific operations.</para> + <para>See the API documentation for <link + xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html" + >Scan</link> for more details.</para> + </section> + + <section> + <title>Bundled HBase MapReduce Jobs</title> + <para>The HBase JAR also serves as a Driver for some bundled mapreduce jobs. To learn about + the bundled MapReduce jobs, run the following command.</para> + + <screen language="bourne">$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar</userinput> +<computeroutput>An example program must be given as the first argument. +Valid program names are: + copytable: Export a table from local cluster to peer cluster + completebulkload: Complete a bulk data load. + export: Write table data to HDFS. + import: Import data written by Export. + importtsv: Import data in TSV format. + rowcounter: Count rows in HBase table</computeroutput> + </screen> + <para>Each of the valid program names are bundled MapReduce jobs. To run one of the jobs, + model your command after the following example.</para> + <screen language="bourne">$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter myTable</userinput></screen> + </section> + + <section> + <title>HBase as a MapReduce Job Data Source and Data Sink</title> + <para>HBase can be used as a data source, <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>, + and data sink, <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link> + or <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.html">MultiTableOutputFormat</link>, + for MapReduce jobs. Writing MapReduce jobs that read or write HBase, it is advisable to + subclass <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link> + and/or <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableReducer.html">TableReducer</link>. + See the do-nothing pass-through classes <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableMapper.html">IdentityTableMapper</link> + and <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableReducer.html">IdentityTableReducer</link> + for basic usage. For a more involved example, see <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link> + or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test. </para> + <para>If you run MapReduce jobs that use HBase as source or sink, need to specify source and + sink table and column names in your configuration.</para> + + <para>When you read from HBase, the <code>TableInputFormat</code> requests the list of regions + from HBase and makes a map, which is either a <code>map-per-region</code> or + <code>mapreduce.job.maps</code> map, whichever is smaller. If your job only has two maps, + raise <code>mapreduce.job.maps</code> to a number greater than the number of regions. Maps + will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer per + node. When writing to HBase, it may make sense to avoid the Reduce step and write back into + HBase from within your map. This approach works when your job does not need the sort and + collation that MapReduce does on the map-emitted data. On insert, HBase 'sorts' so there is + no point double-sorting (and shuffling data around your MapReduce cluster) unless you need + to. If you do not need the Reduce, you myour map might emit counts of records processed for + reporting at the end of the jobj, or set the number of Reduces to zero and use + TableOutputFormat. If running the Reduce step makes sense in your case, you should typically + use multiple reducers so that load is spread across the HBase cluster.</para> + + <para>A new HBase partitioner, the <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/HRegionPartitioner.html">HRegionPartitioner</link>, + can run as many reducers the number of existing regions. The HRegionPartitioner is suitable + when your table is large and your upload will not greatly alter the number of existing + regions upon completion. Otherwise use the default partitioner. </para> + </section> + + <section> + <title>Writing HFiles Directly During Bulk Import</title> + <para>If you are importing into a new table, you can bypass the HBase API and write your + content directly to the filesystem, formatted into HBase data files (HFiles). Your import + will run faster, perhaps an order of magnitude faster. For more on how this mechanism works, + see <xref + linkend="arch.bulk.load" />.</para> + </section> + + <section> + <title>RowCounter Example</title> + <para>The included <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link> + MapReduce job uses <code>TableInputFormat</code> and does a count of all rows in the specified + table. To run it, use the following command: </para> + <screen language="bourne">$ <userinput>./bin/hadoop jar hbase-X.X.X.jar</userinput></screen> + <para>This will + invoke the HBase MapReduce Driver class. Select <literal>rowcounter</literal> from the choice of jobs + offered. This will print rowcouner usage advice to standard output. Specify the tablename, + column to count, and output + directory. If you have classpath errors, see <xref linkend="hbase.mapreduce.classpath" />.</para> + </section> + + <section + xml:id="splitter"> + <title>Map-Task Splitting</title> + <section + xml:id="splitter.default"> + <title>The Default HBase MapReduce Splitter</title> + <para>When <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link> + is used to source an HBase table in a MapReduce job, its splitter will make a map task for + each region of the table. Thus, if there are 100 regions in the table, there will be 100 + map-tasks for the job - regardless of how many column families are selected in the + Scan.</para> + </section> + <section + xml:id="splitter.custom"> + <title>Custom Splitters</title> + <para>For those interested in implementing custom splitters, see the method + <code>getSplits</code> in <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html">TableInputFormatBase</link>. + That is where the logic for map-task assignment resides. </para> + </section> + </section> + <section + xml:id="mapreduce.example"> + <title>HBase MapReduce Examples</title> + <section + xml:id="mapreduce.example.read"> + <title>HBase MapReduce Read Example</title> + <para>The following is an example of using HBase as a MapReduce source in read-only manner. + Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from + the Mapper. There job would be defined as follows...</para> + <programlisting language="java"> +Configuration config = HBaseConfiguration.create(); +Job job = new Job(config, "ExampleRead"); +job.setJarByClass(MyReadJob.class); // class that contains mapper + +Scan scan = new Scan(); +scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs +scan.setCacheBlocks(false); // don't set to true for MR jobs +// set other scan attrs +... + +TableMapReduceUtil.initTableMapperJob( + tableName, // input HBase table name + scan, // Scan instance to control CF and attribute selection + MyMapper.class, // mapper + null, // mapper output key + null, // mapper output value + job); +job.setOutputFormatClass(NullOutputFormat.class); // because we aren't emitting anything from mapper + +boolean b = job.waitForCompletion(true); +if (!b) { + throw new IOException("error with job!"); +} + </programlisting> + <para>...and the mapper instance would extend <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>...</para> + <programlisting language="java"> +public static class MyMapper extends TableMapper<Text, Text> { + + public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException { + // process data for the row from the Result instance. + } +} + </programlisting> + </section> + <section + xml:id="mapreduce.example.readwrite"> + <title>HBase MapReduce Read/Write Example</title> + <para>The following is an example of using HBase both as a source and as a sink with + MapReduce. This example will simply copy data from one table to another.</para> + <programlisting language="java"> +Configuration config = HBaseConfiguration.create(); +Job job = new Job(config,"ExampleReadWrite"); +job.setJarByClass(MyReadWriteJob.class); // class that contains mapper + +Scan scan = new Scan(); +scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs +scan.setCacheBlocks(false); // don't set to true for MR jobs +// set other scan attrs + +TableMapReduceUtil.initTableMapperJob( + sourceTable, // input table + scan, // Scan instance to control CF and attribute selection + MyMapper.class, // mapper class + null, // mapper output key + null, // mapper output value + job); +TableMapReduceUtil.initTableReducerJob( + targetTable, // output table + null, // reducer class + job); +job.setNumReduceTasks(0); + +boolean b = job.waitForCompletion(true); +if (!b) { + throw new IOException("error with job!"); +} + </programlisting> + <para>An explanation is required of what <classname>TableMapReduceUtil</classname> is doing, + especially with the reducer. <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link> + is being used as the outputFormat class, and several parameters are being set on the + config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key + to <classname>ImmutableBytesWritable</classname> and reducer value to + <classname>Writable</classname>. These could be set by the programmer on the job and + conf, but <classname>TableMapReduceUtil</classname> tries to make things easier.</para> + <para>The following is the example mapper, which will create a <classname>Put</classname> + and matching the input <classname>Result</classname> and emit it. Note: this is what the + CopyTable utility does. </para> + <programlisting language="java"> +public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put> { + + public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { + // this example is just copying the data from the source table... + context.write(row, resultToPut(row,value)); + } + + private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException { + Put put = new Put(key.get()); + for (KeyValue kv : result.raw()) { + put.add(kv); + } + return put; + } +} + </programlisting> + <para>There isn't actually a reducer step, so <classname>TableOutputFormat</classname> takes + care of sending the <classname>Put</classname> to the target table. </para> + <para>This is just an example, developers could choose not to use + <classname>TableOutputFormat</classname> and connect to the target table themselves. + </para> + </section> + <section + xml:id="mapreduce.example.readwrite.multi"> + <title>HBase MapReduce Read/Write Example With Multi-Table Output</title> + <para>TODO: example for <classname>MultiTableOutputFormat</classname>. </para> + </section> + <section + xml:id="mapreduce.example.summary"> + <title>HBase MapReduce Summary to HBase Example</title> + <para>The following example uses HBase as a MapReduce source and sink with a summarization + step. This example will count the number of distinct instances of a value in a table and + write those summarized counts in another table. + <programlisting language="java"> +Configuration config = HBaseConfiguration.create(); +Job job = new Job(config,"ExampleSummary"); +job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer + +Scan scan = new Scan(); +scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs +scan.setCacheBlocks(false); // don't set to true for MR jobs +// set other scan attrs + +TableMapReduceUtil.initTableMapperJob( + sourceTable, // input table + scan, // Scan instance to control CF and attribute selection + MyMapper.class, // mapper class + Text.class, // mapper output key + IntWritable.class, // mapper output value + job); +TableMapReduceUtil.initTableReducerJob( + targetTable, // output table + MyTableReducer.class, // reducer class + job); +job.setNumReduceTasks(1); // at least one, adjust as required + +boolean b = job.waitForCompletion(true); +if (!b) { + throw new IOException("error with job!"); +} + </programlisting> + In this example mapper a column with a String-value is chosen as the value to summarize + upon. This value is used as the key to emit from the mapper, and an + <classname>IntWritable</classname> represents an instance counter. + <programlisting language="java"> +public static class MyMapper extends TableMapper<Text, IntWritable> { + public static final byte[] CF = "cf".getBytes(); + public static final byte[] ATTR1 = "attr1".getBytes(); + + private final IntWritable ONE = new IntWritable(1); + private Text text = new Text(); + + public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { + String val = new String(value.getValue(CF, ATTR1)); + text.set(val); // we can only emit Writables... + + context.write(text, ONE); + } +} + </programlisting> + In the reducer, the "ones" are counted (just like any other MR example that does this), + and then emits a <classname>Put</classname>. + <programlisting language="java"> +public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> { + public static final byte[] CF = "cf".getBytes(); + public static final byte[] COUNT = "count".getBytes(); + + public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { + int i = 0; + for (IntWritable val : values) { + i += val.get(); + } + Put put = new Put(Bytes.toBytes(key.toString())); + put.add(CF, COUNT, Bytes.toBytes(i)); + + context.write(null, put); + } +} + </programlisting> + </para> + </section> + <section + xml:id="mapreduce.example.summary.file"> + <title>HBase MapReduce Summary to File Example</title> + <para>This very similar to the summary example above, with exception that this is using + HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and + in the reducer. The mapper remains the same. </para> + <programlisting language="java"> +Configuration config = HBaseConfiguration.create(); +Job job = new Job(config,"ExampleSummaryToFile"); +job.setJarByClass(MySummaryFileJob.class); // class that contains mapper and reducer + +Scan scan = new Scan(); +scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs +scan.setCacheBlocks(false); // don't set to true for MR jobs +// set other scan attrs + +TableMapReduceUtil.initTableMapperJob( + sourceTable, // input table + scan, // Scan instance to control CF and attribute selection + MyMapper.class, // mapper class + Text.class, // mapper output key + IntWritable.class, // mapper output value + job); +job.setReducerClass(MyReducer.class); // reducer class +job.setNumReduceTasks(1); // at least one, adjust as required +FileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile")); // adjust directories as required + +boolean b = job.waitForCompletion(true); +if (!b) { + throw new IOException("error with job!"); +} + </programlisting> + <para>As stated above, the previous Mapper can run unchanged with this example. As for the + Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting + Puts.</para> + <programlisting language="java"> + public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { + + public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { + int i = 0; + for (IntWritable val : values) { + i += val.get(); + } + context.write(key, new IntWritable(i)); + } +} + </programlisting> + </section> + <section + xml:id="mapreduce.example.summary.noreducer"> + <title>HBase MapReduce Summary to HBase Without Reducer</title> + <para>It is also possible to perform summaries without a reducer - if you use HBase as the + reducer. </para> + <para>An HBase target table would need to exist for the job summary. The Table method + <code>incrementColumnValue</code> would be used to atomically increment values. From a + performance perspective, it might make sense to keep a Map of values with their values to + be incremeneted for each map-task, and make one update per key at during the <code> + cleanup</code> method of the mapper. However, your milage may vary depending on the + number of rows to be processed and unique keys. </para> + <para>In the end, the summary results are in HBase. </para> + </section> + <section + xml:id="mapreduce.example.summary.rdbms"> + <title>HBase MapReduce Summary to RDBMS</title> + <para>Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases, + it is possible to generate summaries directly to an RDBMS via a custom reducer. The + <code>setup</code> method can connect to an RDBMS (the connection information can be + passed via custom parameters in the context) and the cleanup method can close the + connection. </para> + <para>It is critical to understand that number of reducers for the job affects the + summarization implementation, and you'll have to design this into your reducer. + Specifically, whether it is designed to run as a singleton (one reducer) or multiple + reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more + reducers that are assigned to the job, the more simultaneous connections to the RDBMS will + be created - this will scale, but only to a point. </para> + <programlisting language="java"> + public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable> { + + private Connection c = null; + + public void setup(Context context) { + // create DB connection... + } + + public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { + // do summarization + // in this example the keys are Text, but this is just an example + } + + public void cleanup(Context context) { + // close db connection + } + +} + </programlisting> + <para>In the end, the summary results are written to your RDBMS table/s. </para> + </section> + + </section> + <!-- mr examples --> + <section + xml:id="mapreduce.htable.access"> + <title>Accessing Other HBase Tables in a MapReduce Job</title> + <para>Although the framework currently allows one HBase table as input to a MapReduce job, + other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating + an Table instance in the setup method of the Mapper. + <programlisting language="java">public class MyMapper extends TableMapper<Text, LongWritable> { + private Table myOtherTable; + + public void setup(Context context) { + // In here create a Connection to the cluster and save it or use the Connection + // from the existing table + myOtherTable = connection.getTable("myOtherTable"); + } + + public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { + // process Result... + // use 'myOtherTable' for lookups + } + + </programlisting> + </para> + </section> + <section + xml:id="mapreduce.specex"> + <title>Speculative Execution</title> + <para>It is generally advisable to turn off speculative execution for MapReduce jobs that use + HBase as a source. This can either be done on a per-Job basis through properties, on on the + entire cluster. Especially for longer running jobs, speculative execution will create + duplicate map-tasks which will double-write your data to HBase; this is probably not what + you want. </para> + <para>See <xref + linkend="spec.ex" /> for more information. </para> + </section> + +</chapter> http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/orca.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/orca.xml b/src/main/docbkx/orca.xml new file mode 100644 index 0000000..29d8727 --- /dev/null +++ b/src/main/docbkx/orca.xml @@ -0,0 +1,47 @@ +<?xml version="1.0" encoding="UTF-8"?> +<appendix + xml:id="orca" + version="5.0" + xmlns="http://docbook.org/ns/docbook" + xmlns:xlink="http://www.w3.org/1999/xlink" + xmlns:xi="http://www.w3.org/2001/XInclude" + xmlns:svg="http://www.w3.org/2000/svg" + xmlns:m="http://www.w3.org/1998/Math/MathML" + xmlns:html="http://www.w3.org/1999/xhtml" + xmlns:db="http://docbook.org/ns/docbook"> + <!--/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +--> + <title>Apache HBase Orca</title> + <figure> + <title>Apache HBase Orca</title> + <mediaobject> + <imageobject> + <imagedata align="center" valign="right" + fileref="jumping-orca_rotated_25percent.png"/> + </imageobject> + </mediaobject> + </figure> + <para><link xlink:href="https://issues.apache.org/jira/browse/HBASE-4920">An Orca is the Apache + HBase mascot.</link> + See NOTICES.txt. Our Orca logo we got here: http://www.vectorfree.com/jumping-orca + It is licensed Creative Commons Attribution 3.0. See https://creativecommons.org/licenses/by/3.0/us/ + We changed the logo by stripping the colored background, inverting + it and then rotating it some. + </para> +</appendix> http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/other_info.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/other_info.xml b/src/main/docbkx/other_info.xml new file mode 100644 index 0000000..72ff274 --- /dev/null +++ b/src/main/docbkx/other_info.xml @@ -0,0 +1,83 @@ +<?xml version="1.0" encoding="UTF-8"?> +<appendix + xml:id="other.info" + version="5.0" + xmlns="http://docbook.org/ns/docbook" + xmlns:xlink="http://www.w3.org/1999/xlink" + xmlns:xi="http://www.w3.org/2001/XInclude" + xmlns:svg="http://www.w3.org/2000/svg" + xmlns:m="http://www.w3.org/1998/Math/MathML" + xmlns:html="http://www.w3.org/1999/xhtml" + xmlns:db="http://docbook.org/ns/docbook"> + <!--/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +--> + <title>Other Information About HBase</title> + <section xml:id="other.info.videos"><title>HBase Videos</title> + <para>Introduction to HBase + <itemizedlist> + <listitem><para><link xlink:href="http://www.cloudera.com/content/cloudera/en/resources/library/presentation/chicago_data_summit_apache_hbase_an_introduction_todd_lipcon.html">Introduction to HBase</link> by Todd Lipcon (Chicago Data Summit 2011). + </para></listitem> + <listitem><para><link xlink:href="http://www.cloudera.com/videos/intorduction-hbase-todd-lipcon">Introduction to HBase</link> by Todd Lipcon (2010). + </para></listitem> + </itemizedlist> + </para> + <para><link xlink:href="http://www.cloudera.com/videos/hadoop-world-2011-presentation-video-building-realtime-big-data-services-at-facebook-with-hadoop-and-hbase">Building Real Time Services at Facebook with HBase</link> by Jonathan Gray (Hadoop World 2011). + </para> + <para><link xlink:href="http://www.cloudera.com/videos/hw10_video_how_stumbleupon_built_and_advertising_platform_using_hbase_and_hadoop">HBase and Hadoop, Mixing Real-Time and Batch Processing at StumbleUpon</link> by JD Cryans (Hadoop World 2010). + </para> + </section> + <section xml:id="other.info.pres"><title>HBase Presentations (Slides)</title> + <para><link xlink:href="http://www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/hadoop-world-2011-presentation-video-advanced-hbase-schema-design.html">Advanced HBase Schema Design</link> by Lars George (Hadoop World 2011). + </para> + <para><link xlink:href="http://www.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction">Introduction to HBase</link> by Todd Lipcon (Chicago Data Summit 2011). + </para> + <para><link xlink:href="http://www.slideshare.net/cloudera/hw09-practical-h-base-getting-the-most-from-your-h-base-install">Getting The Most From Your HBase Install</link> by Ryan Rawson, Jonathan Gray (Hadoop World 2009). + </para> + </section> + <section xml:id="other.info.papers"><title>HBase Papers</title> + <para><link xlink:href="http://research.google.com/archive/bigtable.html">BigTable</link> by Google (2006). + </para> + <para><link xlink:href="http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html">HBase and HDFS Locality</link> by Lars George (2010). + </para> + <para><link xlink:href="http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf">No Relation: The Mixed Blessings of Non-Relational Databases</link> by Ian Varley (2009). + </para> + </section> + <section xml:id="other.info.sites"><title>HBase Sites</title> + <para><link xlink:href="http://www.cloudera.com/blog/category/hbase/">Cloudera's HBase Blog</link> has a lot of links to useful HBase information. + <itemizedlist> + <listitem><para><link xlink:href="http://www.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/">CAP Confusion</link> is a relevant entry for background information on + distributed storage systems.</para> + </listitem> + </itemizedlist> + </para> + <para><link xlink:href="http://wiki.apache.org/hadoop/HBase/HBasePresentations">HBase Wiki</link> has a page with a number of presentations. + </para> + <para><link xlink:href="http://refcardz.dzone.com/refcardz/hbase">HBase RefCard</link> from DZone. + </para> + </section> + <section xml:id="other.info.books"><title>HBase Books</title> + <para><link xlink:href="http://shop.oreilly.com/product/0636920014348.do">HBase: The Definitive Guide</link> by Lars George. + </para> + </section> + <section xml:id="other.info.books.hadoop"><title>Hadoop Books</title> + <para><link xlink:href="http://shop.oreilly.com/product/9780596521981.do">Hadoop: The Definitive Guide</link> by Tom White. + </para> + </section> + +</appendix> http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/performance.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/performance.xml b/src/main/docbkx/performance.xml index 1757d3f..42ed79b 100644 --- a/src/main/docbkx/performance.xml +++ b/src/main/docbkx/performance.xml @@ -273,7 +273,7 @@ tableDesc.addFamily(cfDesc); If there is enough RAM, increasing this can help. </para> </section> - <section xml:id="hbase.regionserver.checksum.verify"> + <section xml:id="hbase.regionserver.checksum.verify.performance"> <title><varname>hbase.regionserver.checksum.verify</varname></title> <para>Have HBase write the checksum into the datablock and save having to do the checksum seek whenever you read.</para> http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/sql.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/sql.xml b/src/main/docbkx/sql.xml new file mode 100644 index 0000000..40f43d6 --- /dev/null +++ b/src/main/docbkx/sql.xml @@ -0,0 +1,40 @@ +<?xml version="1.0" encoding="UTF-8"?> +<appendix + xml:id="sql" + version="5.0" + xmlns="http://docbook.org/ns/docbook" + xmlns:xlink="http://www.w3.org/1999/xlink" + xmlns:xi="http://www.w3.org/2001/XInclude" + xmlns:svg="http://www.w3.org/2000/svg" + xmlns:m="http://www.w3.org/1998/Math/MathML" + xmlns:html="http://www.w3.org/1999/xhtml" + xmlns:db="http://docbook.org/ns/docbook"> + <!--/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +--> + <title>SQL over HBase</title> + <section xml:id="phoenix"> + <title>Apache Phoenix</title> + <para><link xlink:href="http://phoenix.apache.org">Apache Phoenix</link></para> + </section> + <section xml:id="trafodion"> + <title>Trafodion</title> + <para><link xlink:href="https://wiki.trafodion.org/">Trafodion: Transactional SQL-on-HBase</link></para> + </section> + +</appendix> http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/upgrading.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/upgrading.xml b/src/main/docbkx/upgrading.xml index d5708a4..5d71e0f 100644 --- a/src/main/docbkx/upgrading.xml +++ b/src/main/docbkx/upgrading.xml @@ -240,7 +240,7 @@ </table> </section> - <section xml:id="hbase.client.api"> + <section xml:id="hbase.client.api.surface"> <title>HBase API surface</title> <para> HBase has a lot of API points, but for the compatibility matrix above, we differentiate between Client API, Limited Private API, and Private API. HBase uses a version of <link xlink:href="https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html">Hadoop's Interface classification</link>. HBase's Interface classification classes can be found <link xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/classification/package-summary.html"> here</link>. http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/ycsb.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/ycsb.xml b/src/main/docbkx/ycsb.xml new file mode 100644 index 0000000..695614c --- /dev/null +++ b/src/main/docbkx/ycsb.xml @@ -0,0 +1,36 @@ +<?xml version="1.0" encoding="UTF-8"?> +<appendix xml:id="ycsb" version="5.0" xmlns="http://docbook.org/ns/docbook" + xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xi="http://www.w3.org/2001/XInclude" + xmlns:svg="http://www.w3.org/2000/svg" xmlns:m="http://www.w3.org/1998/Math/MathML" + xmlns:html="http://www.w3.org/1999/xhtml" xmlns:db="http://docbook.org/ns/docbook"> + <!--/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +--> + <title>YCSB</title> + <para><link xlink:href="https://github.com/brianfrankcooper/YCSB/">YCSB: The + Yahoo! Cloud Serving Benchmark</link> and HBase</para> + <para>TODO: Describe how YCSB is poor for putting up a decent cluster load.</para> + <para>TODO: Describe setup of YCSB for HBase. In particular, presplit your tables before you + start a run. See <link xlink:href="https://issues.apache.org/jira/browse/HBASE-4163" + >HBASE-4163 Create Split Strategy for YCSB Benchmark</link> for why and a little shell + command for how to do it.</para> + <para>Ted Dunning redid YCSB so it's mavenized and added facility for verifying workloads. See + <link xlink:href="https://github.com/tdunning/YCSB">Ted Dunning's YCSB</link>.</para> + + +</appendix>
