http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/architecture.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/architecture.xml b/src/main/docbkx/architecture.xml
new file mode 100644
index 0000000..16b298a
--- /dev/null
+++ b/src/main/docbkx/architecture.xml
@@ -0,0 +1,3489 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<chapter
+    xml:id="architecture"
+    version="5.0"
+    xmlns="http://docbook.org/ns/docbook";
+    xmlns:xlink="http://www.w3.org/1999/xlink";
+    xmlns:xi="http://www.w3.org/2001/XInclude";
+    xmlns:svg="http://www.w3.org/2000/svg";
+    xmlns:m="http://www.w3.org/1998/Math/MathML";
+    xmlns:html="http://www.w3.org/1999/xhtml";
+    xmlns:db="http://docbook.org/ns/docbook";>
+    <!--/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+-->
+
+    <title>Architecture</title>
+       <section xml:id="arch.overview">
+       <title>Overview</title>
+         <section xml:id="arch.overview.nosql">
+         <title>NoSQL?</title>
+         <para>HBase is a type of "NoSQL" database.  "NoSQL" is a general term 
meaning that the database isn't an RDBMS which
+         supports SQL as its primary access language, but there are many types 
of NoSQL databases:  BerkeleyDB is an
+         example of a local NoSQL database, whereas HBase is very much a 
distributed database.  Technically speaking,
+         HBase is really more a "Data Store" than "Data Base" because it lacks 
many of the features you find in an RDBMS,
+         such as typed columns, secondary indexes, triggers, and advanced 
query languages, etc.
+         </para>
+         <para>However, HBase has many features which supports both linear and 
modular scaling.  HBase clusters expand
+         by adding RegionServers that are hosted on commodity class servers. 
If a cluster expands from 10 to 20
+         RegionServers, for example, it doubles both in terms of storage and 
as well as processing capacity.
+         RDBMS can scale well, but only up to a point - specifically, the size 
of a single database server - and for the best
+         performance requires specialized hardware and storage devices.  HBase 
features of note are:
+               <itemizedlist>
+              <listitem><para>Strongly consistent reads/writes:  HBase is not 
an "eventually consistent" DataStore.  This
+              makes it very suitable for tasks such as high-speed counter 
aggregation.</para>  </listitem>
+              <listitem><para>Automatic sharding:  HBase tables are 
distributed on the cluster via regions, and regions are
+              automatically split and re-distributed as your data 
grows.</para></listitem>
+              <listitem><para>Automatic RegionServer failover</para></listitem>
+              <listitem><para>Hadoop/HDFS Integration:  HBase supports HDFS 
out of the box as its distributed file system.</para></listitem>
+              <listitem><para>MapReduce:  HBase supports massively 
parallelized processing via MapReduce for using HBase as both
+              source and sink.</para></listitem>
+              <listitem><para>Java Client API:  HBase supports an easy to use 
Java API for programmatic access.</para></listitem>
+              <listitem><para>Thrift/REST API:  HBase also supports Thrift and 
REST for non-Java front-ends.</para></listitem>
+              <listitem><para>Block Cache and Bloom Filters:  HBase supports a 
Block Cache and Bloom Filters for high volume query 
optimization.</para></listitem>
+              <listitem><para>Operational Management:  HBase provides build-in 
web-pages for operational insight as well as JMX metrics.</para></listitem>
+            </itemizedlist>
+         </para>
+      </section>
+
+         <section xml:id="arch.overview.when">
+           <title>When Should I Use HBase?</title>
+                 <para>HBase isn't suitable for every problem.</para>
+                 <para>First, make sure you have enough data.  If you have 
hundreds of millions or billions of rows, then
+                   HBase is a good candidate.  If you only have a few 
thousand/million rows, then using a traditional RDBMS
+                   might be a better choice due to the fact that all of your 
data might wind up on a single node (or two) and
+                   the rest of the cluster may be sitting idle.
+                 </para>
+                 <para>Second, make sure you can live without all the extra 
features that an RDBMS provides (e.g., typed columns,
+                 secondary indexes, transactions, advanced query languages, 
etc.)  An application built against an RDBMS cannot be
+                 "ported" to HBase by simply changing a JDBC driver, for 
example.  Consider moving from an RDBMS to HBase as a
+                 complete redesign as opposed to a port.
+              </para>
+                 <para>Third, make sure you have enough hardware.  Even HDFS 
doesn't do well with anything less than
+                5 DataNodes (due to things such as HDFS block replication 
which has a default of 3), plus a NameNode.
+                </para>
+                <para>HBase can run quite well stand-alone on a laptop - but 
this should be considered a development
+                configuration only.
+                </para>
+      </section>
+      <section xml:id="arch.overview.hbasehdfs">
+        <title>What Is The Difference Between HBase and Hadoop/HDFS?</title>
+          <para><link xlink:href="http://hadoop.apache.org/hdfs/";>HDFS</link> 
is a distributed file system that is well suited for the storage of large files.
+          Its documentation states that it is not, however, a general purpose 
file system, and does not provide fast individual record lookups in files.
+          HBase, on the other hand, is built on top of HDFS and provides fast 
record lookups (and updates) for large tables.
+          This can sometimes be a point of conceptual confusion.  HBase 
internally puts your data in indexed "StoreFiles" that exist
+          on HDFS for high-speed lookups.  See the <xref linkend="datamodel" 
/> and the rest of this chapter for more information on how HBase achieves its 
goals.
+         </para>
+      </section>
+       </section>
+
+    <section
+      xml:id="arch.catalog">
+      <title>Catalog Tables</title>
+      <para>The catalog table <code>hbase:meta</code> exists as an HBase table 
and is filtered out of the HBase
+        shell's <code>list</code> command, but is in fact a table just like 
any other. </para>
+      <section
+        xml:id="arch.catalog.root">
+        <title>-ROOT-</title>
+        <note>
+          <para>The <code>-ROOT-</code> table was removed in HBase 0.96.0. 
Information here should
+            be considered historical.</para>
+        </note>
+        <para>The <code>-ROOT-</code> table kept track of the location of the
+            <code>.META</code> table (the previous name for the table now 
called <code>hbase:meta</code>) prior to HBase
+          0.96. The <code>-ROOT-</code> table structure was as follows: </para>
+        <itemizedlist>
+          <title>Key</title>
+          <listitem>
+            <para>.META. region key (<code>.META.,,1</code>)</para>
+          </listitem>
+        </itemizedlist>
+
+        <itemizedlist>
+          <title>Values</title>
+          <listitem>
+            <para><code>info:regioninfo</code> (serialized <link
+                
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html";>HRegionInfo</link>
+              instance of hbase:meta)</para>
+          </listitem>
+          <listitem>
+            <para><code>info:server</code> (server:port of the RegionServer 
holding
+              hbase:meta)</para>
+          </listitem>
+          <listitem>
+            <para><code>info:serverstartcode</code> (start-time of the 
RegionServer process holding
+              hbase:meta)</para>
+          </listitem>
+        </itemizedlist>
+      </section>
+      <section
+        xml:id="arch.catalog.meta">
+        <title>hbase:meta</title>
+        <para>The <code>hbase:meta</code> table (previously called 
<code>.META.</code>) keeps a list
+          of all regions in the system. The location of 
<code>hbase:meta</code> was previously
+          tracked within the <code>-ROOT-</code> table, but is now stored in 
Zookeeper.</para>
+        <para>The <code>hbase:meta</code> table structure is as follows: 
</para>
+        <itemizedlist>
+          <title>Key</title>
+          <listitem>
+            <para>Region key of the format (<code>[table],[region start 
key],[region
+              id]</code>)</para>
+          </listitem>
+        </itemizedlist>
+        <itemizedlist>
+          <title>Values</title>
+          <listitem>
+            <para><code>info:regioninfo</code> (serialized <link
+                
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html";>
+                HRegionInfo</link> instance for this region)</para>
+          </listitem>
+          <listitem>
+            <para><code>info:server</code> (server:port of the RegionServer 
containing this
+              region)</para>
+          </listitem>
+          <listitem>
+            <para><code>info:serverstartcode</code> (start-time of the 
RegionServer process
+              containing this region)</para>
+          </listitem>
+        </itemizedlist>
+        <para>When a table is in the process of splitting, two other columns 
will be created, called
+            <code>info:splitA</code> and <code>info:splitB</code>. These 
columns represent the two
+          daughter regions. The values for these columns are also serialized 
HRegionInfo instances.
+          After the region has been split, eventually this row will be 
deleted. </para>
+        <note>
+          <title>Note on HRegionInfo</title>
+          <para>The empty key is used to denote table start and table end. A 
region with an empty
+            start key is the first region in a table. If a region has both an 
empty start and an
+            empty end key, it is the only region in the table </para>
+        </note>
+        <para>In the (hopefully unlikely) event that programmatic processing 
of catalog metadata is
+          required, see the <link
+            
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Writables.html#getHRegionInfo%28byte[]%29";>Writables</link>
+          utility. </para>
+      </section>
+      <section
+        xml:id="arch.catalog.startup">
+        <title>Startup Sequencing</title>
+        <para>First, the location of <code>hbase:meta</code> is looked up in 
Zookeeper. Next,
+          <code>hbase:meta</code> is updated with server and startcode 
values.</para>  
+        <para>For information on region-RegionServer assignment, see <xref
+            linkend="regions.arch.assignment" />. </para>
+      </section>
+    </section>  <!--  catalog -->
+
+    <section
+      xml:id="client">
+      <title>Client</title>
+      <para>The HBase client finds the RegionServers that are serving the 
particular row range of
+        interest. It does this by querying the <code>hbase:meta</code> table. 
See <xref
+          linkend="arch.catalog.meta" /> for details. After locating the 
required region(s), the
+        client contacts the RegionServer serving that region, rather than 
going through the master,
+        and issues the read or write request. This information is cached in 
the client so that
+        subsequent requests need not go through the lookup process. Should a 
region be reassigned
+        either by the master load balancer or because a RegionServer has died, 
the client will
+        requery the catalog tables to determine the new location of the user 
region. </para>
+
+      <para>See <xref
+          linkend="master.runtime" /> for more information about the impact of 
the Master on HBase
+        Client communication. </para>
+      <para>Administrative functions are done via an instance of <link
+          
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html";>Admin</link>
+      </para>
+
+      <section
+        xml:id="client.connections">
+        <title>Cluster Connections</title>
+        <para>The API changed in HBase 1.0. Its been cleaned up and users are 
returned
+          Interfaces to work against rather than particular types. In HBase 
1.0,
+          obtain a cluster Connection from ConnectionFactory and thereafter, 
get from it
+          instances of Table, Admin, and RegionLocator on an as-need basis. 
When done, close
+          obtained instances.  Finally, be sure to cleanup your Connection 
instance before
+          exiting.  Connections are heavyweight objects. Create once and keep 
an instance around.
+          Table, Admin and RegionLocator instances are lightweight. Create as 
you go and then
+          let go as soon as you are done by closing them. See the
+          <link 
xlink:href="/Users/stack/checkouts/hbase.git/target/site/apidocs/org/apache/hadoop/hbase/client/package-summary.html">Client
 Package Javadoc Description</link> for example usage of the new HBase 1.0 
API.</para>
+
+        <para>For connection configuration information, see <xref 
linkend="client_dependencies" />. </para>
+
+        <para><emphasis><link
+              
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html";>Table</link>
+            instances are not thread-safe</emphasis>. Only one thread can use 
an instance of Table at
+          any given time. When creating Table instances, it is advisable to 
use the same <link
+            
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HBaseConfiguration";>HBaseConfiguration</link>
+          instance. This will ensure sharing of ZooKeeper and socket instances 
to the RegionServers
+          which is usually what you want. For example, this is 
preferred:</para>
+          <programlisting language="java">HBaseConfiguration conf = 
HBaseConfiguration.create();
+HTable table1 = new HTable(conf, "myTable");
+HTable table2 = new HTable(conf, "myTable");</programlisting>
+          <para>as opposed to this:</para>
+          <programlisting language="java">HBaseConfiguration conf1 = 
HBaseConfiguration.create();
+HTable table1 = new HTable(conf1, "myTable");
+HBaseConfiguration conf2 = HBaseConfiguration.create();
+HTable table2 = new HTable(conf2, "myTable");</programlisting>
+
+        <para>For more information about how connections are handled in the 
HBase client,
+        see <link 
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html";>HConnectionManager</link>.
+          </para>
+          <section xml:id="client.connection.pooling"><title>Connection 
Pooling</title>
+            <para>For applications which require high-end multithreaded access 
(e.g., web-servers or application servers that may serve many application 
threads
+            in a single JVM), you can pre-create an 
<classname>HConnection</classname>, as shown in
+              the following example:</para>
+            <example>
+              <title>Pre-Creating a <code>HConnection</code></title>
+              <programlisting language="java">// Create a connection to the 
cluster.
+HConnection connection = HConnectionManager.createConnection(Configuration);
+HTableInterface table = connection.getTable("myTable");
+// use table as needed, the table returned is lightweight
+table.close();
+// use the connection for other access to the cluster
+connection.close();</programlisting>
+            </example>
+          <para>Constructing HTableInterface implementation is very 
lightweight and resources are
+            controlled.</para>
+            <warning>
+              <title><code>HTablePool</code> is Deprecated</title>
+              <para>Previous versions of this guide discussed 
<code>HTablePool</code>, which was
+                deprecated in HBase 0.94, 0.95, and 0.96, and removed in 
0.98.1, by <link
+                  
xlink:href="https://issues.apache.org/jira/browse/HBASE-6580";>HBASE-6500</link>.
+                Please use <link 
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnection.html";><code>HConnection</code></link>
 instead.</para>
+            </warning>
+          </section>
+         </section>
+          <section xml:id="client.writebuffer"><title>WriteBuffer and Batch 
Methods</title>
+           <para>If <xref linkend="perf.hbase.client.autoflush" /> is turned 
off on
+               <link 
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html";>HTable</link>,
+               <classname>Put</classname>s are sent to RegionServers when the 
writebuffer
+               is filled.  The writebuffer is 2MB by default.  Before an 
HTable instance is
+               discarded, either <methodname>close()</methodname> or
+               <methodname>flushCommits()</methodname> should be invoked so 
Puts
+               will not be lost.
+             </para>
+             <para>Note: <code>htable.delete(Delete);</code> does not go in 
the writebuffer!  This only applies to Puts.
+             </para>
+             <para>For additional information on write durability, review the 
<link xlink:href="../acid-semantics.html">ACID semantics</link> page.
+             </para>
+       <para>For fine-grained control of batching of
+           <classname>Put</classname>s or <classname>Delete</classname>s,
+           see the <link 
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#batch%28java.util.List%29";>batch</link>
 methods on HTable.
+          </para>
+          </section>
+          <section xml:id="client.external"><title>External Clients</title>
+           <para>Information on non-Java clients and custom protocols is 
covered in <xref linkend="external_apis" />
+           </para>
+               </section>
+       </section>
+
+    <section xml:id="client.filter"><title>Client Request Filters</title>
+      <para><link 
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html";>Get</link>
 and <link 
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html";>Scan</link>
 instances can be
+       optionally configured with <link 
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.html";>filters</link>
 which are applied on the RegionServer.
+      </para>
+      <para>Filters can be confusing because there are many different types, 
and it is best to approach them by understanding the groups
+      of Filter functionality.
+      </para>
+      <section xml:id="client.filter.structural"><title>Structural</title>
+        <para>Structural Filters contain other Filters.</para>
+        <section xml:id="client.filter.structural.fl"><title>FilterList</title>
+          <para><link 
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterList.html";>FilterList</link>
+          represents a list of Filters with a relationship of 
<code>FilterList.Operator.MUST_PASS_ALL</code> or
+          <code>FilterList.Operator.MUST_PASS_ONE</code> between the Filters.  
The following example shows an 'or' between two
+          Filters (checking for either 'my value' or 'my other value' on the 
same attribute).</para>
+<programlisting language="java">
+FilterList list = new FilterList(FilterList.Operator.MUST_PASS_ONE);
+SingleColumnValueFilter filter1 = new SingleColumnValueFilter(
+       cf,
+       column,
+       CompareOp.EQUAL,
+       Bytes.toBytes("my value")
+       );
+list.add(filter1);
+SingleColumnValueFilter filter2 = new SingleColumnValueFilter(
+       cf,
+       column,
+       CompareOp.EQUAL,
+       Bytes.toBytes("my other value")
+       );
+list.add(filter2);
+scan.setFilter(list);
+</programlisting>
+        </section>
+      </section>
+      <section
+        xml:id="client.filter.cv">
+        <title>Column Value</title>
+        <section
+          xml:id="client.filter.cv.scvf">
+          <title>SingleColumnValueFilter</title>
+          <para><link
+              
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.html";>SingleColumnValueFilter</link>
+            can be used to test column values for equivalence (<code><link
+                
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/CompareFilter.CompareOp.html";>CompareOp.EQUAL</link>
+            </code>), inequality (<code>CompareOp.NOT_EQUAL</code>), or ranges 
(e.g.,
+              <code>CompareOp.GREATER</code>). The following is example of 
testing equivalence a
+            column to a String value "my value"...</para>
+          <programlisting language="java">
+SingleColumnValueFilter filter = new SingleColumnValueFilter(
+       cf,
+       column,
+       CompareOp.EQUAL,
+       Bytes.toBytes("my value")
+       );
+scan.setFilter(filter);
+</programlisting>
+        </section>
+      </section>
+      <section
+        xml:id="client.filter.cvp">
+        <title>Column Value Comparators</title>
+        <para>There are several Comparator classes in the Filter package that 
deserve special
+          mention. These Comparators are used in concert with other Filters, 
such as <xref
+            linkend="client.filter.cv.scvf" />. </para>
+        <section
+          xml:id="client.filter.cvp.rcs">
+          <title>RegexStringComparator</title>
+          <para><link
+              
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/RegexStringComparator.html";>RegexStringComparator</link>
+            supports regular expressions for value comparisons.</para>
+          <programlisting language="java">
+RegexStringComparator comp = new RegexStringComparator("my.");   // any value 
that starts with 'my'
+SingleColumnValueFilter filter = new SingleColumnValueFilter(
+       cf,
+       column,
+       CompareOp.EQUAL,
+       comp
+       );
+scan.setFilter(filter);
+</programlisting>
+          <para>See the Oracle JavaDoc for <link
+              
xlink:href="http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html";>supported
+              RegEx patterns in Java</link>. </para>
+        </section>
+        <section
+          xml:id="client.filter.cvp.SubStringComparator">
+          <title>SubstringComparator</title>
+          <para><link
+              
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SubstringComparator.html";>SubstringComparator</link>
+            can be used to determine if a given substring exists in a value. 
The comparison is
+            case-insensitive. </para>
+          <programlisting language="java">
+SubstringComparator comp = new SubstringComparator("y val");   // looking for 
'my value'
+SingleColumnValueFilter filter = new SingleColumnValueFilter(
+       cf,
+       column,
+       CompareOp.EQUAL,
+       comp
+       );
+scan.setFilter(filter);
+</programlisting>
+        </section>
+        <section
+          xml:id="client.filter.cvp.bfp">
+          <title>BinaryPrefixComparator</title>
+          <para>See <link
+              
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/BinaryPrefixComparator.html";>BinaryPrefixComparator</link>.</para>
+        </section>
+        <section
+          xml:id="client.filter.cvp.bc">
+          <title>BinaryComparator</title>
+          <para>See <link
+              
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/BinaryComparator.html";>BinaryComparator</link>.</para>
+        </section>
+      </section>
+      <section
+        xml:id="client.filter.kvm">
+        <title>KeyValue Metadata</title>
+        <para>As HBase stores data internally as KeyValue pairs, KeyValue 
Metadata Filters evaluate
+          the existence of keys (i.e., ColumnFamily:Column qualifiers) for a 
row, as opposed to
+          values the previous section. </para>
+        <section
+          xml:id="client.filter.kvm.ff">
+          <title>FamilyFilter</title>
+          <para><link
+              
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FamilyFilter.html";>FamilyFilter</link>
+            can be used to filter on the ColumnFamily. It is generally a 
better idea to select
+            ColumnFamilies in the Scan than to do it with a Filter.</para>
+        </section>
+        <section
+          xml:id="client.filter.kvm.qf">
+          <title>QualifierFilter</title>
+          <para><link
+              
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/QualifierFilter.html";>QualifierFilter</link>
+            can be used to filter based on Column (aka Qualifier) name. </para>
+        </section>
+        <section
+          xml:id="client.filter.kvm.cpf">
+          <title>ColumnPrefixFilter</title>
+          <para><link
+              
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnPrefixFilter.html";>ColumnPrefixFilter</link>
+            can be used to filter based on the lead portion of Column (aka 
Qualifier) names. </para>
+          <para>A ColumnPrefixFilter seeks ahead to the first column matching 
the prefix in each row
+            and for each involved column family. It can be used to efficiently 
get a subset of the
+            columns in very wide rows. </para>
+          <para>Note: The same column qualifier can be used in different 
column families. This
+            filter returns all matching columns. </para>
+          <para>Example: Find all columns in a row and family that start with 
"abc"</para>
+          <programlisting language="java">
+HTableInterface t = ...;
+byte[] row = ...;
+byte[] family = ...;
+byte[] prefix = Bytes.toBytes("abc");
+Scan scan = new Scan(row, row); // (optional) limit to one row
+scan.addFamily(family); // (optional) limit to one family
+Filter f = new ColumnPrefixFilter(prefix);
+scan.setFilter(f);
+scan.setBatch(10); // set this if there could be many columns returned
+ResultScanner rs = t.getScanner(scan);
+for (Result r = rs.next(); r != null; r = rs.next()) {
+  for (KeyValue kv : r.raw()) {
+    // each kv represents a column
+  }
+}
+rs.close();
+</programlisting>
+        </section>
+        <section
+          xml:id="client.filter.kvm.mcpf">
+          <title>MultipleColumnPrefixFilter</title>
+          <para><link
+              
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/MultipleColumnPrefixFilter.html";>MultipleColumnPrefixFilter</link>
+            behaves like ColumnPrefixFilter but allows specifying multiple 
prefixes. </para>
+          <para>Like ColumnPrefixFilter, MultipleColumnPrefixFilter 
efficiently seeks ahead to the
+            first column matching the lowest prefix and also seeks past ranges 
of columns between
+            prefixes. It can be used to efficiently get discontinuous sets of 
columns from very wide
+            rows. </para>
+          <para>Example: Find all columns in a row and family that start with 
"abc" or "xyz"</para>
+          <programlisting language="java">
+HTableInterface t = ...;
+byte[] row = ...;
+byte[] family = ...;
+byte[][] prefixes = new byte[][] {Bytes.toBytes("abc"), Bytes.toBytes("xyz")};
+Scan scan = new Scan(row, row); // (optional) limit to one row
+scan.addFamily(family); // (optional) limit to one family
+Filter f = new MultipleColumnPrefixFilter(prefixes);
+scan.setFilter(f);
+scan.setBatch(10); // set this if there could be many columns returned
+ResultScanner rs = t.getScanner(scan);
+for (Result r = rs.next(); r != null; r = rs.next()) {
+  for (KeyValue kv : r.raw()) {
+    // each kv represents a column
+  }
+}
+rs.close();
+</programlisting>
+        </section>
+        <section
+          xml:id="client.filter.kvm.crf ">
+          <title>ColumnRangeFilter</title>
+          <para>A <link
+              
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnRangeFilter.html";>ColumnRangeFilter</link>
+            allows efficient intra row scanning. </para>
+          <para>A ColumnRangeFilter can seek ahead to the first matching 
column for each involved
+            column family. It can be used to efficiently get a 'slice' of the 
columns of a very wide
+            row. i.e. you have a million columns in a row but you only want to 
look at columns
+            bbbb-bbdd. </para>
+          <para>Note: The same column qualifier can be used in different 
column families. This
+            filter returns all matching columns. </para>
+          <para>Example: Find all columns in a row and family between "bbbb" 
(inclusive) and "bbdd"
+            (inclusive)</para>
+          <programlisting language="java">
+HTableInterface t = ...;
+byte[] row = ...;
+byte[] family = ...;
+byte[] startColumn = Bytes.toBytes("bbbb");
+byte[] endColumn = Bytes.toBytes("bbdd");
+Scan scan = new Scan(row, row); // (optional) limit to one row
+scan.addFamily(family); // (optional) limit to one family
+Filter f = new ColumnRangeFilter(startColumn, true, endColumn, true);
+scan.setFilter(f);
+scan.setBatch(10); // set this if there could be many columns returned
+ResultScanner rs = t.getScanner(scan);
+for (Result r = rs.next(); r != null; r = rs.next()) {
+  for (KeyValue kv : r.raw()) {
+    // each kv represents a column
+  }
+}
+rs.close();
+</programlisting>
+            <para>Note:  Introduced in HBase 0.92</para>
+        </section>
+      </section>
+      <section xml:id="client.filter.row"><title>RowKey</title>
+        <section xml:id="client.filter.row.rf"><title>RowFilter</title>
+          <para>It is generally a better idea to use the startRow/stopRow 
methods on Scan for row selection, however
+          <link 
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/RowFilter.html";>RowFilter</link>
 can also be used.</para>
+        </section>
+      </section>
+      <section xml:id="client.filter.utility"><title>Utility</title>
+        <section 
xml:id="client.filter.utility.fkof"><title>FirstKeyOnlyFilter</title>
+          <para>This is primarily used for rowcount jobs.
+          See <link 
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FirstKeyOnlyFilter.html";>FirstKeyOnlyFilter</link>.</para>
+        </section>
+      </section>
+       </section>  <!--  client.filter -->
+
+    <section xml:id="master"><title>Master</title>
+      <para><code>HMaster</code> is the implementation of the Master Server. 
The Master server is
+        responsible for monitoring all RegionServer instances in the cluster, 
and is the interface
+        for all metadata changes. In a distributed cluster, the Master 
typically runs on the <xref
+          linkend="arch.hdfs.nn"/>. J Mohamed Zahoor goes into some more 
detail on the Master
+        Architecture in this blog posting, <link
+          
xlink:href="http://blog.zahoor.in/2012/08/hbase-hmaster-architecture/";>HBase 
HMaster
+          Architecture </link>.</para>
+       <section xml:id="master.startup"><title>Startup Behavior</title>
+         <para>If run in a multi-Master environment, all Masters compete to 
run the cluster.  If the active
+         Master loses its lease in ZooKeeper (or the Master shuts down), then 
then the remaining Masters jostle to
+         take over the Master role.
+         </para>
+       </section>
+      <section
+        xml:id="master.runtime">
+        <title>Runtime Impact</title>
+        <para>A common dist-list question involves what happens to an HBase 
cluster when the Master
+          goes down. Because the HBase client talks directly to the 
RegionServers, the cluster can
+          still function in a "steady state." Additionally, per <xref
+            linkend="arch.catalog" />, <code>hbase:meta</code> exists as an 
HBase table and is not
+          resident in the Master. However, the Master controls critical 
functions such as
+          RegionServer failover and completing region splits. So while the 
cluster can still run for
+          a short time without the Master, the Master should be restarted as 
soon as possible.
+        </para>
+      </section>
+       <section xml:id="master.api"><title>Interface</title>
+         <para>The methods exposed by <code>HMasterInterface</code> are 
primarily metadata-oriented methods:
+         <itemizedlist>
+            <listitem><para>Table (createTable, modifyTable, removeTable, 
enable, disable)
+            </para></listitem>
+            <listitem><para>ColumnFamily (addColumn, modifyColumn, 
removeColumn)
+            </para></listitem>
+            <listitem><para>Region (move, assign, unassign)
+            </para></listitem>
+         </itemizedlist>
+         For example, when the <code>HBaseAdmin</code> method 
<code>disableTable</code> is invoked, it is serviced by the Master server.
+         </para>
+       </section>
+       <section xml:id="master.processes"><title>Processes</title>
+         <para>The Master runs several background threads:
+         </para>
+         <section 
xml:id="master.processes.loadbalancer"><title>LoadBalancer</title>
+           <para>Periodically, and when there are no regions in transition,
+             a load balancer will run and move regions around to balance the 
cluster's load.
+             See <xref linkend="balancer_config" /> for configuring this 
property.</para>
+             <para>See <xref linkend="regions.arch.assignment"/> for more 
information on region assignment.
+             </para>
+         </section>
+         <section 
xml:id="master.processes.catalog"><title>CatalogJanitor</title>
+           <para>Periodically checks and cleans up the hbase:meta table.  See 
<xref linkend="arch.catalog.meta" /> for more information on META.</para>
+         </section>
+       </section>
+
+     </section>
+    <section
+      xml:id="regionserver.arch">
+      <title>RegionServer</title>
+      <para><code>HRegionServer</code> is the RegionServer implementation. It 
is responsible for
+        serving and managing regions. In a distributed cluster, a RegionServer 
runs on a <xref
+          linkend="arch.hdfs.dn" />. </para>
+      <section
+        xml:id="regionserver.arch.api">
+        <title>Interface</title>
+        <para>The methods exposed by <code>HRegionRegionInterface</code> 
contain both data-oriented
+          and region-maintenance methods: <itemizedlist>
+            <listitem>
+              <para>Data (get, put, delete, next, etc.)</para>
+            </listitem>
+            <listitem>
+              <para>Region (splitRegion, compactRegion, etc.)</para>
+            </listitem>
+          </itemizedlist> For example, when the <code>HBaseAdmin</code> method
+            <code>majorCompact</code> is invoked on a table, the client is 
actually iterating
+          through all regions for the specified table and requesting a major 
compaction directly to
+          each region. </para>
+      </section>
+      <section
+        xml:id="regionserver.arch.processes">
+        <title>Processes</title>
+        <para>The RegionServer runs a variety of background threads:</para>
+        <section
+          xml:id="regionserver.arch.processes.compactsplit">
+          <title>CompactSplitThread</title>
+          <para>Checks for splits and handle minor compactions.</para>
+        </section>
+        <section
+          xml:id="regionserver.arch.processes.majorcompact">
+          <title>MajorCompactionChecker</title>
+          <para>Checks for major compactions.</para>
+        </section>
+        <section
+          xml:id="regionserver.arch.processes.memstore">
+          <title>MemStoreFlusher</title>
+          <para>Periodically flushes in-memory writes in the MemStore to 
StoreFiles.</para>
+        </section>
+        <section
+          xml:id="regionserver.arch.processes.log">
+          <title>LogRoller</title>
+          <para>Periodically checks the RegionServer's WAL.</para>
+        </section>
+      </section>
+
+      <section
+        xml:id="coprocessors">
+        <title>Coprocessors</title>
+        <para>Coprocessors were added in 0.92. There is a thorough <link
+            
xlink:href="https://blogs.apache.org/hbase/entry/coprocessor_introduction";>Blog 
Overview
+            of CoProcessors</link> posted. Documentation will eventually move 
to this reference
+          guide, but the blog is the most current information available at 
this time. </para>
+      </section>
+
+      <section
+        xml:id="block.cache">
+        <title>Block Cache</title>
+
+        <para>HBase provides two different BlockCache implementations: the 
default onheap
+          LruBlockCache and BucketCache, which is (usually) offheap. This 
section
+          discusses benefits and drawbacks of each implementation, how to 
choose the appropriate
+          option, and configuration options for each.</para>
+
+      <note><title>Block Cache Reporting: UI</title>
+      <para>See the RegionServer UI for detail on caching deploy.  Since 
HBase-0.98.4, the
+          Block Cache detail has been significantly extended showing 
configurations,
+          sizings, current usage, time-in-the-cache, and even detail on block 
counts and types.</para>
+  </note>
+
+        <section>
+
+          <title>Cache Choices</title>
+          <para><classname>LruBlockCache</classname> is the original 
implementation, and is
+              entirely within the Java heap. 
<classname>BucketCache</classname> is mainly
+              intended for keeping blockcache data offheap, although 
BucketCache can also
+              keep data onheap and serve from a file-backed cache.
+              <note><title>BucketCache is production ready as of 
hbase-0.98.6</title>
+                <para>To run with BucketCache, you need HBASE-11678. This was 
included in
+                  hbase-0.98.6.
+                </para>
+              </note>
+          </para>
+
+          <para>Fetching will always be slower when fetching from BucketCache,
+              as compared to the native onheap LruBlockCache. However, 
latencies tend to be
+              less erratic across time, because there is less garbage 
collection when you use
+              BucketCache since it is managing BlockCache allocations, not the 
GC. If the
+              BucketCache is deployed in offheap mode, this memory is not 
managed by the
+              GC at all. This is why you'd use BucketCache, so your latencies 
are less erratic and to mitigate GCs
+              and heap fragmentation.  See Nick Dimiduk's <link
+              xlink:href="http://www.n10k.com/blog/blockcache-101/";>BlockCache 
101</link> for
+            comparisons running onheap vs offheap tests. Also see
+            <link xlink:href="http://people.apache.org/~stack/bc/";>Comparing 
BlockCache Deploys</link>
+            which finds that if your dataset fits inside your LruBlockCache 
deploy, use it otherwise
+            if you are experiencing cache churn (or you want your cache to 
exist beyond the
+            vagaries of java GC), use BucketCache.
+              </para>
+
+              <para>When you enable BucketCache, you are enabling a two tier 
caching
+              system, an L1 cache which is implemented by an instance of 
LruBlockCache and
+              an offheap L2 cache which is implemented by BucketCache.  
Management of these
+              two tiers and the policy that dictates how blocks move between 
them is done by
+              <classname>CombinedBlockCache</classname>. It keeps all DATA 
blocks in the L2
+              BucketCache and meta blocks -- INDEX and BLOOM blocks --
+              onheap in the L1 <classname>LruBlockCache</classname>.
+              See <xref linkend="offheap.blockcache" /> for more detail on 
going offheap.</para>
+        </section>
+
+        <section xml:id="cache.configurations">
+            <title>General Cache Configurations</title>
+          <para>Apart from the cache implementation itself, you can set some 
general configuration
+            options to control how the cache performs. See <link
+              
xlink:href="http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html";
+            />. After setting any of these options, restart or rolling restart 
your cluster for the
+            configuration to take effect. Check logs for errors or unexpected 
behavior.</para>
+          <para>See also <xref linkend="blockcache.prefetch"/>, which 
discusses a new option
+            introduced in <link 
xlink:href="https://issues.apache.org/jira/browse/HBASE-9857";
+              >HBASE-9857</link>.</para>
+      </section>
+
+        <section
+          xml:id="block.cache.design">
+          <title>LruBlockCache Design</title>
+          <para>The LruBlockCache is an LRU cache that contains three levels 
of block priority to
+            allow for scan-resistance and in-memory ColumnFamilies: </para>
+          <itemizedlist>
+            <listitem>
+              <para>Single access priority: The first time a block is loaded 
from HDFS it normally
+                has this priority and it will be part of the first group to be 
considered during
+                evictions. The advantage is that scanned blocks are more 
likely to get evicted than
+                blocks that are getting more usage.</para>
+            </listitem>
+            <listitem>
+              <para>Mutli access priority: If a block in the previous priority 
group is accessed
+                again, it upgrades to this priority. It is thus part of the 
second group considered
+                during evictions.</para>
+            </listitem>
+            <listitem xml:id="hbase.cache.inmemory">
+              <para>In-memory access priority: If the block's family was 
configured to be
+                "in-memory", it will be part of this priority disregarding the 
number of times it
+                was accessed. Catalog tables are configured like this. This 
group is the last one
+                considered during evictions.</para>
+            <para>To mark a column family as in-memory, call
+                <programlisting 
language="java">HColumnDescriptor.setInMemory(true);</programlisting> if 
creating a table from java,
+                or set <command>IN_MEMORY => true</command> when creating or 
altering a table in
+                the shell: e.g.  <programlisting>hbase(main):003:0> create  
't', {NAME => 'f', IN_MEMORY => 'true'}</programlisting></para>
+            </listitem>
+          </itemizedlist>
+          <para> For more information, see the <link
+              
xlink:href="http://hbase.apache.org/xref/org/apache/hadoop/hbase/io/hfile/LruBlockCache.html";>LruBlockCache
+              source</link>
+          </para>
+        </section>
+        <section
+          xml:id="block.cache.usage">
+          <title>LruBlockCache Usage</title>
+          <para>Block caching is enabled by default for all the user tables 
which means that any
+            read operation will load the LRU cache. This might be good for a 
large number of use
+            cases, but further tunings are usually required in order to 
achieve better performance.
+            An important concept is the <link
+              
xlink:href="http://en.wikipedia.org/wiki/Working_set_size";>working set 
size</link>, or
+            WSS, which is: "the amount of memory needed to compute the answer 
to a problem". For a
+            website, this would be the data that's needed to answer the 
queries over a short amount
+            of time. </para>
+          <para>The way to calculate how much memory is available in HBase for 
caching is: </para>
+          <programlisting>
+            number of region servers * heap size * hfile.block.cache.size * 
0.99
+        </programlisting>
+          <para>The default value for the block cache is 0.25 which represents 
25% of the available
+            heap. The last value (99%) is the default acceptable loading 
factor in the LRU cache
+            after which eviction is started. The reason it is included in this 
equation is that it
+            would be unrealistic to say that it is possible to use 100% of the 
available memory
+            since this would make the process blocking from the point where it 
loads new blocks.
+            Here are some examples: </para>
+          <itemizedlist>
+            <listitem>
+              <para>One region server with the default heap size (1 GB) and 
the default block cache
+                size will have 253 MB of block cache available.</para>
+            </listitem>
+            <listitem>
+              <para>20 region servers with the heap size set to 8 GB and a 
default block cache size
+                will have 39.6 of block cache.</para>
+            </listitem>
+            <listitem>
+              <para>100 region servers with the heap size set to 24 GB and a 
block cache size of 0.5
+                will have about 1.16 TB of block cache.</para>
+            </listitem>
+        </itemizedlist>
+        <para>Your data is not the only resident of the block cache. Here are 
others that you may have to take into account:
+        </para>
+          <variablelist>
+            <varlistentry>
+              <term>Catalog Tables</term>
+              <listitem>
+                <para>The <code>-ROOT-</code> (prior to HBase 0.96. See <xref
+                    linkend="arch.catalog.root" />) and 
<code>hbase:meta</code> tables are forced
+                  into the block cache and have the in-memory priority which 
means that they are
+                  harder to evict. The former never uses more than a few 
hundreds of bytes while the
+                  latter can occupy a few MBs (depending on the number of 
regions).</para>
+              </listitem>
+            </varlistentry>
+            <varlistentry>
+              <term>HFiles Indexes</term>
+              <listitem>
+                <para>An <firstterm>hfile</firstterm> is the file format that 
HBase uses to store
+                  data in HDFS. It contains a multi-layered index which allows 
HBase to seek to the
+                  data without having to read the whole file. The size of 
those indexes is a factor
+                  of the block size (64KB by default), the size of your keys 
and the amount of data
+                  you are storing. For big data sets it's not unusual to see 
numbers around 1GB per
+                  region server, although not all of it will be in cache 
because the LRU will evict
+                  indexes that aren't used.</para>
+              </listitem>
+            </varlistentry>
+            <varlistentry>
+              <term>Keys</term>
+              <listitem>
+                <para>The values that are stored are only half the picture, 
since each value is
+                  stored along with its keys (row key, family qualifier, and 
timestamp). See <xref
+                    linkend="keysize" />.</para>
+              </listitem>
+            </varlistentry>
+            <varlistentry>
+              <term>Bloom Filters</term>
+              <listitem>
+                <para>Just like the HFile indexes, those data structures (when 
enabled) are stored
+                  in the LRU.</para>
+              </listitem>
+            </varlistentry>
+          </variablelist>
+          <para>Currently the recommended way to measure HFile indexes and 
bloom filters sizes is to
+            look at the region server web UI and checkout the relevant 
metrics. For keys, sampling
+            can be done by using the HFile command line tool and look for the 
average key size
+            metric. Since HBase 0.98.3, you can view detail on BlockCache 
stats and metrics
+            in a special Block Cache section in the UI.</para>
+          <para>It's generally bad to use block caching when the WSS doesn't 
fit in memory. This is
+            the case when you have for example 40GB available across all your 
region servers' block
+            caches but you need to process 1TB of data. One of the reasons is 
that the churn
+            generated by the evictions will trigger more garbage collections 
unnecessarily. Here are
+            two use cases: </para>
+        <itemizedlist>
+            <listitem>
+              <para>Fully random reading pattern: This is a case where you 
almost never access the
+                same row twice within a short amount of time such that the 
chance of hitting a
+                cached block is close to 0. Setting block caching on such a 
table is a waste of
+                memory and CPU cycles, more so that it will generate more 
garbage to pick up by the
+                JVM. For more information on monitoring GC, see <xref
+                  linkend="trouble.log.gc" />.</para>
+            </listitem>
+            <listitem>
+              <para>Mapping a table: In a typical MapReduce job that takes a 
table in input, every
+                row will be read only once so there's no need to put them into 
the block cache. The
+                Scan object has the option of turning this off via the 
setCaching method (set it to
+                false). You can still keep block caching turned on on this 
table if you need fast
+                random read access. An example would be counting the number of 
rows in a table that
+                serves live traffic, caching every block of that table would 
create massive churn
+                and would surely evict data that's currently in use. </para>
+            </listitem>
+          </itemizedlist>
+          <section xml:id="data.blocks.in.fscache">
+            <title>Caching META blocks only (DATA blocks in fscache)</title>
+            <para>An interesting setup is one where we cache META blocks only 
and we read DATA
+              blocks in on each access. If the DATA blocks fit inside fscache, 
this alternative
+              may make sense when access is completely random across a very 
large dataset.
+              To enable this setup, alter your table and for each column family
+              set <varname>BLOCKCACHE => 'false'</varname>.  You are 
'disabling' the
+              BlockCache for this column family only you can never disable the 
caching of
+              META blocks. Since
+              <link 
xlink:href="https://issues.apache.org/jira/browse/HBASE-4683";>HBASE-4683 Always 
cache index and bloom blocks</link>,
+              we will cache META blocks even if the BlockCache is disabled.
+            </para>
+          </section>
+        </section>
+        <section
+          xml:id="offheap.blockcache">
+          <title>Offheap Block Cache</title>
+          <section xml:id="enable.bucketcache">
+            <title>How to Enable BucketCache</title>
+                <para>The usual deploy of BucketCache is via a managing class 
that sets up two caching tiers: an L1 onheap cache
+                    implemented by LruBlockCache and a second L2 cache 
implemented with BucketCache. The managing class is <link
+                        
xlink:href="http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CombinedBlockCache.html";>CombinedBlockCache</link>
 by default.
+            The just-previous link describes the caching 'policy' implemented 
by CombinedBlockCache. In short, it works
+            by keeping meta blocks -- INDEX and BLOOM in the L1, onheap 
LruBlockCache tier -- and DATA
+            blocks are kept in the L2, BucketCache tier. It is possible to 
amend this behavior in
+            HBase since version 1.0 and ask that a column family have both its 
meta and DATA blocks hosted onheap in the L1 tier by
+            setting <varname>cacheDataInL1</varname> via
+                  <code>(HColumnDescriptor.setCacheDataInL1(true)</code>
+            or in the shell, creating or amending column families setting 
<varname>CACHE_DATA_IN_L1</varname>
+            to true: e.g. <programlisting>hbase(main):003:0> create 't', {NAME 
=> 't', CONFIGURATION => {CACHE_DATA_IN_L1 => 'true'}}</programlisting></para>
+
+        <para>The BucketCache Block Cache can be deployed onheap, offheap, or 
file based.
+            You set which via the
+            <varname>hbase.bucketcache.ioengine</varname> setting.  Setting it 
to
+            <varname>heap</varname> will have BucketCache deployed inside the 
+            allocated java heap. Setting it to <varname>offheap</varname> will 
have
+            BucketCache make its allocations offheap,
+            and an ioengine setting of <varname>file:PATH_TO_FILE</varname> 
will direct
+            BucketCache to use a file caching (Useful in particular if you 
have some fast i/o attached to the box such
+            as SSDs).
+        </para>
+        <para xml:id="raw.l1.l2">It is possible to deploy an L1+L2 setup where 
we bypass the CombinedBlockCache
+            policy and have BucketCache working as a strict L2 cache to the L1
+              LruBlockCache. For such a setup, set 
<varname>CacheConfig.BUCKET_CACHE_COMBINED_KEY</varname> to
+              <literal>false</literal>. In this mode, on eviction from L1, 
blocks go to L2.
+              When a block is cached, it is cached first in L1. When we go to 
look for a cached block,
+              we look first in L1 and if none found, then search L2.  Let us 
call this deploy format,
+              <emphasis><indexterm><primary>Raw 
L1+L2</primary></indexterm></emphasis>.</para>
+          <para>Other BucketCache configs include: specifying a location to 
persist cache to across
+              restarts, how many threads to use writing the cache, etc.  See 
the
+              <link 
xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html";>CacheConfig.html</link>
+              class for configuration options and descriptions.</para>
+
+            <procedure>
+              <title>BucketCache Example Configuration</title>
+              <para>This sample provides a configuration for a 4 GB offheap 
BucketCache with a 1 GB
+                  onheap cache. Configuration is performed on the 
RegionServer.  Setting
+                  <varname>hbase.bucketcache.ioengine</varname> and 
+                  <varname>hbase.bucketcache.size</varname> &gt; 0 enables 
CombinedBlockCache.
+                  Let us presume that the RegionServer has been set to run 
with a 5G heap:
+                  i.e. HBASE_HEAPSIZE=5g.
+              </para>
+              <step>
+                <para>First, edit the RegionServer's 
<filename>hbase-env.sh</filename> and set
+                  <varname>HBASE_OFFHEAPSIZE</varname> to a value greater than 
the offheap size wanted, in
+                  this case, 4 GB (expressed as 4G).  Lets set it to 5G.  
That'll be 4G
+                  for our offheap cache and 1G for any other uses of offheap 
memory (there are
+                  other users of offheap memory other than BlockCache; e.g. 
DFSClient 
+                  in RegionServer can make use of offheap memory). See <xref 
linkend="direct.memory" />.</para>
+                <programlisting>HBASE_OFFHEAPSIZE=5G</programlisting>
+              </step>
+              <step>
+                <para>Next, add the following configuration to the 
RegionServer's
+                    <filename>hbase-site.xml</filename>.</para>
+                <programlisting language="xml">
+<![CDATA[<property>
+  <name>hbase.bucketcache.ioengine</name>
+  <value>offheap</value>
+</property>
+<property>
+  <name>hfile.block.cache.size</name>
+  <value>0.2</value>
+</property>
+<property>
+  <name>hbase.bucketcache.size</name>
+  <value>4196</value>
+</property>]]>
+          </programlisting>
+              </step>
+              <step>
+                <para>Restart or rolling restart your cluster, and check the 
logs for any
+                  issues.</para>
+              </step>
+            </procedure>
+            <para>In the above, we set bucketcache to be 4G.  The onheap 
lrublockcache we
+                configured to have 0.2 of the RegionServer's heap size (0.2 * 
5G = 1G).
+                In other words, you configure the L1 LruBlockCache as you 
would normally,
+                as you would when there is no L2 BucketCache present.
+            </para>
+            <para><link 
xlink:href="https://issues.apache.org/jira/browse/HBASE-10641";
+                >HBASE-10641</link> introduced the ability to configure 
multiple sizes for the
+              buckets of the bucketcache, in HBase 0.98 and newer. To 
configurable multiple bucket
+              sizes, configure the new property 
<option>hfile.block.cache.sizes</option> (instead of
+                <option>hfile.block.cache.size</option>) to a comma-separated 
list of block sizes,
+              ordered from smallest to largest, with no spaces. The goal is to 
optimize the bucket
+              sizes based on your data access patterns. The following example 
configures buckets of
+              size 4096 and 8192.</para>
+            <screen language="xml"><![CDATA[
+<property>
+  <name>hfile.block.cache.sizes</name>
+  <value>4096,8192</value>
+</property>
+              ]]></screen>
+            <note xml:id="direct.memory">
+                <title>Direct Memory Usage In HBase</title>
+                <para>The default maximum direct memory varies by JVM.  
Traditionally it is 64M
+                    or some relation to allocated heap size (-Xmx) or no limit 
at all (JDK7 apparently).
+                    HBase servers use direct memory, in particular 
short-circuit reading, the hosted DFSClient will
+                    allocate direct memory buffers.  If you do offheap block 
caching, you'll
+                    be making use of direct memory.  Starting your JVM, make 
sure
+                    the <varname>-XX:MaxDirectMemorySize</varname> setting in
+                    <filename>conf/hbase-env.sh</filename> is set to some 
value that is
+                    higher than what you have allocated to your offheap 
blockcache
+                    (<varname>hbase.bucketcache.size</varname>).  It should be 
larger than your offheap block
+                    cache and then some for DFSClient usage (How much the 
DFSClient uses is not
+                    easy to quantify; it is the number of open hfiles * 
<varname>hbase.dfs.client.read.shortcircuit.buffer.size</varname>
+                    where hbase.dfs.client.read.shortcircuit.buffer.size is 
set to 128k in HBase -- see <filename>hbase-default.xml</filename>
+                    default configurations).
+                        Direct memory, which is part of the Java process heap, 
is separate from the object
+                        heap allocated by -Xmx. The value allocated by 
MaxDirectMemorySize must not exceed
+                        physical RAM, and is likely to be less than the total 
available RAM due to other
+                        memory requirements and system constraints.
+                </para>
+              <para>You can see how much memory -- onheap and offheap/direct 
-- a RegionServer is
+                configured to use and how much it is using at any one time by 
looking at the
+                  <emphasis>Server Metrics: Memory</emphasis> tab in the UI. 
It can also be gotten
+                via JMX. In particular the direct memory currently used by the 
server can be found
+                on the <varname>java.nio.type=BufferPool,name=direct</varname> 
bean. Terracotta has
+                a <link
+                  
xlink:href="http://terracotta.org/documentation/4.0/bigmemorygo/configuration/storage-options";
+                  >good write up</link> on using offheap memory in java. It is 
for their product
+                BigMemory but alot of the issues noted apply in general to any 
attempt at going
+                offheap. Check it out.</para>
+            </note>
+              <note 
xml:id="hbase.bucketcache.percentage.in.combinedcache"><title>hbase.bucketcache.percentage.in.combinedcache</title>
+                  <para>This is a pre-HBase 1.0 configuration removed because 
it
+                      was confusing. It was a float that you would set to some 
value
+                      between 0.0 and 1.0.  Its default was 0.9. If the deploy 
was using
+                      CombinedBlockCache, then the LruBlockCache L1 size was 
calculated to
+                      be (1 - 
<varname>hbase.bucketcache.percentage.in.combinedcache</varname>) * 
<varname>size-of-bucketcache</varname> 
+                      and the BucketCache size was 
<varname>hbase.bucketcache.percentage.in.combinedcache</varname> * 
size-of-bucket-cache.
+                      where size-of-bucket-cache itself is EITHER the value of 
the configuration hbase.bucketcache.size
+                      IF it was specified as megabytes OR 
<varname>hbase.bucketcache.size</varname> * 
<varname>-XX:MaxDirectMemorySize</varname> if
+                      <varname>hbase.bucketcache.size</varname> between 0 and 
1.0.
+                  </para>
+                  <para>In 1.0, it should be more straight-forward. L1 
LruBlockCache size
+                      is set as a fraction of java heap using 
hfile.block.cache.size setting
+                      (not the best name) and L2 is set as above either in 
absolute
+                      megabytes or as a fraction of allocated maximum direct 
memory.
+                  </para>
+              </note>
+          </section>
+        </section>
+        <section>
+          <title>Comprewssed Blockcache</title>
+          <para><link 
xlink:href="https://issues.apache.org/jira/browse/HBASE-11331";
+              >HBASE-11331</link> introduced lazy blockcache decompression, 
more simply referred to
+            as compressed blockcache. When compressed blockcache is enabled. 
data and encoded data
+            blocks are cached in the blockcache in their on-disk format, 
rather than being
+            decompressed and decrypted before caching.</para>
+          <para 
xlink:href="https://issues.apache.org/jira/browse/HBASE-11331";>For a 
RegionServer
+            hosting more data than can fit into cache, enabling this feature 
with SNAPPY compression
+            has been shown to result in 50% increase in throughput and 30% 
improvement in mean
+            latency while, increasing garbage collection by 80% and increasing 
overall CPU load by
+            2%. See HBASE-11331 for more details about how performance was 
measured and achieved.
+            For a RegionServer hosting data that can comfortably fit into 
cache, or if your workload
+            is sensitive to extra CPU or garbage-collection load, you may 
receive less
+            benefit.</para>
+          <para>Compressed blockcache is disabled by default. To enable it, set
+              <code>hbase.block.data.cachecompressed</code> to 
<code>true</code> in
+              <filename>hbase-site.xml</filename> on all RegionServers.</para>
+        </section>
+      </section>
+
+      <section
+        xml:id="wal">
+        <title>Write Ahead Log (WAL)</title>
+
+        <section
+          xml:id="purpose.wal">
+          <title>Purpose</title>
+          <para>The <firstterm>Write Ahead Log (WAL)</firstterm> records all 
changes to data in
+            HBase, to file-based storage. Under normal operations, the WAL is 
not needed because
+            data changes move from the MemStore to StoreFiles. However, if a 
RegionServer crashes or
+          becomes unavailable before the MemStore is flushed, the WAL ensures 
that the changes to
+          the data can be replayed. If writing to the WAL fails, the entire 
operation to modify the
+          data fails.</para>
+          <para>
+            HBase uses an implementation of the <link xlink:href=
+            
"http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/wal/WAL.html";
+            >WAL</link> interface. Usually, there is only one instance of a 
WAL per RegionServer.
+            The RegionServer records Puts and Deletes to it, before recording 
them to the <xref
+              linkend="store.memstore" /> for the affected <xref
+              linkend="store" />.
+          </para>
+          <note>
+            <title>The HLog</title>
+            <para>
+              Prior to 2.0, the interface for WALs in HBase was named 
<classname>HLog</classname>.
+              In 0.94, HLog was the name of the implementation of the WAL. You 
will likely find
+              references to the HLog in documentation tailored to these older 
versions.
+            </para>
+          </note>
+          <para>The WAL resides in HDFS in the 
<filename>/hbase/WALs/</filename> directory (prior to
+            HBase 0.94, they were stored in 
<filename>/hbase/.logs/</filename>), with subdirectories per
+            region.</para>
+          <para> For more general information about the concept of write ahead 
logs, see the
+            Wikipedia <link
+              
xlink:href="http://en.wikipedia.org/wiki/Write-ahead_logging";>Write-Ahead 
Log</link>
+            article. </para>
+        </section>
+        <section
+          xml:id="wal_flush">
+          <title>WAL Flushing</title>
+          <para>TODO (describe). </para>
+        </section>
+
+        <section
+          xml:id="wal_splitting">
+          <title>WAL Splitting</title>
+
+          <para>A RegionServer serves many regions. All of the regions in a 
region server share the
+            same active WAL file. Each edit in the WAL file includes 
information about which region
+            it belongs to. When a region is opened, the edits in the WAL file 
which belong to that
+            region need to be replayed. Therefore, edits in the WAL file must 
be grouped by region
+            so that particular sets can be replayed to regenerate the data in 
a particular region.
+            The process of grouping the WAL edits by region is called 
<firstterm>log
+              splitting</firstterm>. It is a critical process for recovering 
data if a region server
+            fails.</para>
+          <para>Log splitting is done by the HMaster during cluster start-up 
or by the ServerShutdownHandler
+            as a region server shuts down. So that consistency is guaranteed, 
affected regions
+            are unavailable until data is restored. All WAL edits need to be 
recovered and replayed
+            before a given region can become available again. As a result, 
regions affected by
+            log splitting are unavailable until the process completes.</para>
+          <procedure xml:id="log.splitting.step.by.step">
+            <title>Log Splitting, Step by Step</title>
+            <step>
+              <title>The 
<filename>/hbase/WALs/&lt;host>,&lt;port>,&lt;startcode></filename> directory 
is renamed.</title>
+              <para>Renaming the directory is important because a RegionServer 
may still be up and
+                accepting requests even if the HMaster thinks it is down. If 
the RegionServer does
+                not respond immediately and does not heartbeat its ZooKeeper 
session, the HMaster
+                may interpret this as a RegionServer failure. Renaming the 
logs directory ensures
+                that existing, valid WAL files which are still in use by an 
active but busy
+                RegionServer are not written to by accident.</para>
+              <para>The new directory is named according to the following 
pattern:</para>
+              
<screen><![CDATA[/hbase/WALs/<host>,<port>,<startcode>-splitting]]></screen>
+              <para>An example of such a renamed directory might look like the 
following:</para>
+              
<screen>/hbase/WALs/srv.example.com,60020,1254173957298-splitting</screen>
+            </step>
+            <step>
+              <title>Each log file is split, one at a time.</title>
+              <para>The log splitter reads the log file one edit entry at a 
time and puts each edit
+                entry into the buffer corresponding to the edit’s region. At 
the same time, the
+                splitter starts several writer threads. Writer threads pick up 
a corresponding
+                buffer and write the edit entries in the buffer to a temporary 
recovered edit
+                file. The temporary edit file is stored to disk with the 
following naming pattern:</para>
+              
<screen><![CDATA[/hbase/<table_name>/<region_id>/recovered.edits/.temp]]></screen>
+              <para>This file is used to store all the edits in the WAL log 
for this region. After
+                log splitting completes, the <filename>.temp</filename> file 
is renamed to the
+                sequence ID of the first log written to the file.</para>
+              <para>To determine whether all edits have been written, the 
sequence ID is compared to
+                the sequence of the last edit that was written to the HFile. 
If the sequence of the
+                last edit is greater than or equal to the sequence ID included 
in the file name, it
+                is clear that all writes from the edit file have been 
completed.</para>
+            </step>
+            <step>
+              <title>After log splitting is complete, each affected region is 
assigned to a
+                RegionServer.</title>
+              <para> When the region is opened, the 
<filename>recovered.edits</filename> folder is checked for recovered
+                edits files. If any such files are present, they are replayed 
by reading the edits
+                and saving them to the MemStore. After all edit files are 
replayed, the contents of
+                the MemStore are written to disk (HFile) and the edit files 
are deleted.</para>
+            </step>
+          </procedure>
+  
+          <section>
+            <title>Handling of Errors During Log Splitting</title>
+
+            <para>If you set the 
<varname>hbase.hlog.split.skip.errors</varname> option to
+                <constant>true</constant>, errors are treated as 
follows:</para>
+            <itemizedlist>
+              <listitem>
+                <para>Any error encountered during splitting will be 
logged.</para>
+              </listitem>
+              <listitem>
+                <para>The problematic WAL log will be moved into the 
<filename>.corrupt</filename>
+                  directory under the hbase <varname>rootdir</varname>,</para>
+              </listitem>
+              <listitem>
+                <para>Processing of the WAL will continue</para>
+              </listitem>
+            </itemizedlist>
+            <para>If the <varname>hbase.hlog.split.skip.errors</varname> 
optionset to
+                <literal>false</literal>, the default, the exception will be 
propagated and the
+              split will be logged as failed. See <link
+                    
xlink:href="https://issues.apache.org/jira/browse/HBASE-2958";>HBASE-2958 When
+                    hbase.hlog.split.skip.errors is set to false, we fail the 
split but thats
+                    it</link>. We need to do more than just fail split if this 
flag is set.</para>
+            
+            <section>
+              <title>How EOFExceptions are treated when splitting a crashed 
RegionServers'
+                WALs</title>
+
+              <para>If an EOFException occurs while splitting logs, the split 
proceeds even when
+                  <varname>hbase.hlog.split.skip.errors</varname> is set to
+                <literal>false</literal>. An EOFException while reading the 
last log in the set of
+                files to split is likely, because the RegionServer is likely 
to be in the process of
+                writing a record at the time of a crash. For background, see 
<link
+                      
xlink:href="https://issues.apache.org/jira/browse/HBASE-2643";>HBASE-2643
+                      Figure how to deal with eof splitting logs</link></para>
+            </section>
+          </section>
+          
+          <section>
+            <title>Performance Improvements during Log Splitting</title>
+            <para>
+              WAL log splitting and recovery can be resource intensive and 
take a long time,
+              depending on the number of RegionServers involved in the crash 
and the size of the
+              regions. <xref linkend="distributed.log.splitting" /> and <xref
+                linkend="distributed.log.replay" /> were developed to improve
+              performance during log splitting.
+            </para>
+            <section xml:id="distributed.log.splitting">
+              <title>Distributed Log Splitting</title>
+              <para><firstterm>Distributed Log Splitting</firstterm> was added 
in HBase version 0.92
+                (<link 
xlink:href="https://issues.apache.org/jira/browse/HBASE-1364";>HBASE-1364</link>)
 
+                by Prakash Khemani from Facebook. It reduces the time to 
complete log splitting
+                dramatically, improving the availability of regions and 
tables. For
+                example, recovering a crashed cluster took around 9 hours with 
single-threaded log
+                splitting, but only about six minutes with distributed log 
splitting.</para>
+              <para>The information in this section is sourced from Jimmy 
Xiang's blog post at <link
+              
xlink:href="http://blog.cloudera.com/blog/2012/07/hbase-log-splitting/"; 
/>.</para>
+              
+              <formalpara>
+                <title>Enabling or Disabling Distributed Log Splitting</title>
+                <para>Distributed log processing is enabled by default since 
HBase 0.92. The setting
+                  is controlled by the 
<property>hbase.master.distributed.log.splitting</property>
+                  property, which can be set to <literal>true</literal> or 
<literal>false</literal>,
+                  but defaults to <literal>true</literal>. </para>
+              </formalpara>
+              <procedure>
+                <title>Distributed Log Splitting, Step by Step</title>
+                <para>After configuring distributed log splitting, the HMaster 
controls the process.
+                  The HMaster enrolls each RegionServer in the log splitting 
process, and the actual
+                  work of splitting the logs is done by the RegionServers. The 
general process for
+                  log splitting, as described in <xref
+                    linkend="log.splitting.step.by.step" /> still applies 
here.</para>
+                <step>
+                  <para>If distributed log processing is enabled, the HMaster 
creates a
+                    <firstterm>split log manager</firstterm> instance when the 
cluster is started.
+                    The split log manager manages all log files which need
+                    to be scanned and split. The split log manager places all 
the logs into the
+                    ZooKeeper splitlog node 
(<filename>/hbase/splitlog</filename>) as tasks. You can
+                  view the contents of the splitlog by issuing the following
+                    <command>zkcli</command> command. Example output is 
shown.</para>
+                  <screen language="bourne">ls /hbase/splitlog
+[hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost8.sample.com%2C57020%2C1340474893275-splitting%2Fhost8.sample.com%253A57020.1340474893900,
 
+hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost3.sample.com%2C57020%2C1340474893299-splitting%2Fhost3.sample.com%253A57020.1340474893931,
 
+hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost4.sample.com%2C57020%2C1340474893287-splitting%2Fhost4.sample.com%253A57020.1340474893946]
                  
+                  </screen>
+                  <para>The output contains some non-ASCII characters. When 
decoded, it looks much
+                    more simple:</para>
+                  <screen>
+[hdfs://host2.sample.com:56020/hbase/.logs
+/host8.sample.com,57020,1340474893275-splitting
+/host8.sample.com%3A57020.1340474893900, 
+hdfs://host2.sample.com:56020/hbase/.logs
+/host3.sample.com,57020,1340474893299-splitting
+/host3.sample.com%3A57020.1340474893931, 
+hdfs://host2.sample.com:56020/hbase/.logs
+/host4.sample.com,57020,1340474893287-splitting
+/host4.sample.com%3A57020.1340474893946]                    
+                  </screen>
+                  <para>The listing represents WAL file names to be scanned 
and split, which is a
+                    list of log splitting tasks.</para>
+                </step>
+                <step>
+                  <title>The split log manager monitors the log-splitting 
tasks and workers.</title>
+                  <para>The split log manager is responsible for the following 
ongoing tasks:</para>
+                  <itemizedlist>
+                    <listitem>
+                      <para>Once the split log manager publishes all the tasks 
to the splitlog
+                        znode, it monitors these task nodes and waits for them 
to be
+                        processed.</para>
+                    </listitem>
+                    <listitem>
+                      <para>Checks to see if there are any dead split log
+                        workers queued up. If it finds tasks claimed by 
unresponsive workers, it
+                        will resubmit those tasks. If the resubmit fails due 
to some ZooKeeper
+                        exception, the dead worker is queued up again for 
retry.</para>
+                    </listitem>
+                    <listitem>
+                      <para>Checks to see if there are any unassigned
+                        tasks. If it finds any, it create an ephemeral rescan 
node so that each
+                        split log worker is notified to re-scan unassigned 
tasks via the
+                          <code>nodeChildrenChanged</code> ZooKeeper 
event.</para>
+                    </listitem>
+                    <listitem>
+                      <para>Checks for tasks which are assigned but expired. 
If any are found, they
+                        are moved back to <code>TASK_UNASSIGNED</code> state 
again so that they can
+                        be retried. It is possible that these tasks are 
assigned to slow workers, or
+                        they may already be finished. This is not a problem, 
because log splitting
+                        tasks have the property of idempotence. In other 
words, the same log
+                        splitting task can be processed many times without 
causing any
+                        problem.</para>
+                    </listitem>
+                    <listitem>
+                      <para>The split log manager watches the HBase split log 
znodes constantly. If
+                        any split log task node data is changed, the split log 
manager retrieves the
+                        node data. The
+                        node data contains the current state of the task. You 
can use the
+                        <command>zkcli</command> <command>get</command> 
command to retrieve the
+                        current state of a task. In the example output below, 
the first line of the
+                        output shows that the task is currently 
unassigned.</para>
+                      <screen>
+<userinput>get 
/hbase/splitlog/hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost6.sample.com%2C57020%2C1340474893287-splitting%2Fhost6.sample.com%253A57020.1340474893945
+</userinput> 
+<computeroutput>unassigned host2.sample.com:57000
+cZxid = 0×7115
+ctime = Sat Jun 23 11:13:40 PDT 2012
+...</computeroutput>  
+                      </screen>
+                      <para>Based on the state of the task whose data is 
changed, the split log
+                        manager does one of the following:</para>
+
+                      <itemizedlist>
+                        <listitem>
+                          <para>Resubmit the task if it is unassigned</para>
+                        </listitem>
+                        <listitem>
+                          <para>Heartbeat the task if it is assigned</para>
+                        </listitem>
+                        <listitem>
+                          <para>Resubmit or fail the task if it is resigned 
(see <xref
+                            linkend="distributed.log.replay.failure.reasons" 
/>)</para>
+                        </listitem>
+                        <listitem>
+                          <para>Resubmit or fail the task if it is completed 
with errors (see <xref
+                            linkend="distributed.log.replay.failure.reasons" 
/>)</para>
+                        </listitem>
+                        <listitem>
+                          <para>Resubmit or fail the task if it could not 
complete due to
+                            errors (see <xref
+                            linkend="distributed.log.replay.failure.reasons" 
/>)</para>
+                        </listitem>
+                        <listitem>
+                          <para>Delete the task if it is successfully 
completed or failed</para>
+                        </listitem>
+                      </itemizedlist>
+                      <itemizedlist 
xml:id="distributed.log.replay.failure.reasons">
+                        <title>Reasons a Task Will Fail</title>
+                        <listitem><para>The task has been 
deleted.</para></listitem>
+                        <listitem><para>The node no longer 
exists.</para></listitem>
+                        <listitem><para>The log status manager failed to move 
the state of the task
+                          to TASK_UNASSIGNED.</para></listitem>
+                        <listitem><para>The number of resubmits is over the 
resubmit
+                          threshold.</para></listitem>
+                      </itemizedlist>
+                    </listitem>
+                  </itemizedlist>
+                </step>
+                <step>
+                  <title>Each RegionServer's split log worker performs the 
log-splitting tasks.</title>
+                  <para>Each RegionServer runs a daemon thread called the 
<firstterm>split log
+                      worker</firstterm>, which does the work to split the 
logs. The daemon thread
+                    starts when the RegionServer starts, and registers itself 
to watch HBase znodes.
+                    If any splitlog znode children change, it notifies a 
sleeping worker thread to
+                    wake up and grab more tasks. If if a worker's current 
task’s node data is
+                    changed, the worker checks to see if the task has been 
taken by another worker.
+                    If so, the worker thread stops work on the current 
task.</para>
+                  <para>The worker monitors
+                    the splitlog znode constantly. When a new task appears, 
the split log worker
+                    retrieves  the task paths and checks each one until it 
finds an unclaimed task,
+                    which it attempts to claim. If the claim was successful, 
it attempts to perform
+                    the task and updates the task's <property>state</property> 
property based on the
+                    splitting outcome. At this point, the split log worker 
scans for another
+                    unclaimed task.</para>
+                  <itemizedlist>
+                    <title>How the Split Log Worker Approaches a Task</title>
+
+                    <listitem>
+                      <para>It queries the task state and only takes action if 
the task is in
+                          <literal>TASK_UNASSIGNED </literal>state.</para>
+                    </listitem>
+                    <listitem>
+                      <para>If the task is is in 
<literal>TASK_UNASSIGNED</literal> state, the
+                        worker attempts to set the state to 
<literal>TASK_OWNED</literal> by itself.
+                        If it fails to set the state, another worker will try 
to grab it. The split
+                        log manager will also ask all workers to rescan later 
if the task remains
+                        unassigned.</para>
+                    </listitem>
+                    <listitem>
+                      <para>If the worker succeeds in taking ownership of the 
task, it tries to get
+                        the task state again to make sure it really gets it 
asynchronously. In the
+                        meantime, it starts a split task executor to do the 
actual work: </para>
+                      <itemizedlist>
+                        <listitem>
+                          <para>Get the HBase root folder, create a temp 
folder under the root, and
+                            split the log file to the temp folder.</para>
+                        </listitem>
+                        <listitem>
+                          <para>If the split was successful, the task executor 
sets the task to
+                            state <literal>TASK_DONE</literal>.</para>
+                        </listitem>
+                        <listitem>
+                          <para>If the worker catches an unexpected 
IOException, the task is set to
+                            state <literal>TASK_ERR</literal>.</para>
+                        </listitem>
+                        <listitem>
+                          <para>If the worker is shutting down, set the the 
task to state
+                              <literal>TASK_RESIGNED</literal>.</para>
+                        </listitem>
+                        <listitem>
+                          <para>If the task is taken by another worker, just 
log it.</para>
+                        </listitem>
+                      </itemizedlist>
+                    </listitem>
+                  </itemizedlist>
+                </step>
+                <step>
+                  <title>The split log manager monitors for uncompleted 
tasks.</title>
+                  <para>The split log manager returns when all tasks are 
completed successfully. If
+                    all tasks are completed with some failures, the split log 
manager throws an
+                    exception so that the log splitting can be retried. Due to 
an asynchronous
+                    implementation, in very rare cases, the split log manager 
loses track of some
+                    completed tasks. For that reason, it periodically checks 
for remaining
+                    uncompleted task in its task map or ZooKeeper. If none are 
found, it throws an
+                    exception so that the log splitting can be retried right 
away instead of hanging
+                    there waiting for something that won’t happen.</para>
+                </step>
+              </procedure>
+            </section>
+            <section xml:id="distributed.log.replay">
+              <title>Distributed Log Replay</title>
+              <para>After a RegionServer fails, its failed region is assigned 
to another
+                RegionServer, which is marked as "recovering" in ZooKeeper. A 
split log worker directly
+                replays edits from the WAL of the failed region server to the 
region at its new
+                location. When a region is in "recovering" state, it can 
accept writes but no reads
+                (including Append and Increment), region splits or merges. 
</para>
+              <para>Distributed Log Replay extends the <xref 
linkend="distributed.log.splitting" /> framework. It works by
+                directly replaying WAL edits to another RegionServer instead 
of creating
+                  <filename>recovered.edits</filename> files. It provides the 
following advantages
+                over distributed log splitting alone:</para>
+              <itemizedlist>
+                <listitem><para>It eliminates the overhead of writing and 
reading a large number of
+                  <filename>recovered.edits</filename> files. It is not 
unusual for thousands of
+                  <filename>recovered.edits</filename> files to be created and 
written concurrently
+                  during a RegionServer recovery. Many small random writes can 
degrade overall
+                  system performance.</para></listitem>
+                <listitem><para>It allows writes even when a region is in 
recovering state. It only takes seconds for a recovering region to accept 
writes again. 
+</para></listitem>
+              </itemizedlist>
+              <formalpara>
+                <title>Enabling Distributed Log Replay</title>
+                <para>To enable distributed log replay, set 
<varname>hbase.master.distributed.log.replay</varname> to
+                  true. This will be the default for HBase 0.99 (<link
+                    
xlink:href="https://issues.apache.org/jira/browse/HBASE-10888";>HBASE-10888</link>).</para>
+              </formalpara>
+              <para>You must also enable HFile version 3 (which is the default 
HFile format starting
+                in HBase 0.99. See <link
+                  
xlink:href="https://issues.apache.org/jira/browse/HBASE-10855";>HBASE-1085

<TRUNCATED>

Reply via email to