http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/architecture.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/architecture.xml b/src/main/docbkx/architecture.xml new file mode 100644 index 0000000..16b298a --- /dev/null +++ b/src/main/docbkx/architecture.xml @@ -0,0 +1,3489 @@ +<?xml version="1.0" encoding="UTF-8"?> +<chapter + xml:id="architecture" + version="5.0" + xmlns="http://docbook.org/ns/docbook" + xmlns:xlink="http://www.w3.org/1999/xlink" + xmlns:xi="http://www.w3.org/2001/XInclude" + xmlns:svg="http://www.w3.org/2000/svg" + xmlns:m="http://www.w3.org/1998/Math/MathML" + xmlns:html="http://www.w3.org/1999/xhtml" + xmlns:db="http://docbook.org/ns/docbook"> + <!--/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +--> + + <title>Architecture</title> + <section xml:id="arch.overview"> + <title>Overview</title> + <section xml:id="arch.overview.nosql"> + <title>NoSQL?</title> + <para>HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which + supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an + example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, + HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, + such as typed columns, secondary indexes, triggers, and advanced query languages, etc. + </para> + <para>However, HBase has many features which supports both linear and modular scaling. HBase clusters expand + by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 + RegionServers, for example, it doubles both in terms of storage and as well as processing capacity. + RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best + performance requires specialized hardware and storage devices. HBase features of note are: + <itemizedlist> + <listitem><para>Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This + makes it very suitable for tasks such as high-speed counter aggregation.</para> </listitem> + <listitem><para>Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are + automatically split and re-distributed as your data grows.</para></listitem> + <listitem><para>Automatic RegionServer failover</para></listitem> + <listitem><para>Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system.</para></listitem> + <listitem><para>MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both + source and sink.</para></listitem> + <listitem><para>Java Client API: HBase supports an easy to use Java API for programmatic access.</para></listitem> + <listitem><para>Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends.</para></listitem> + <listitem><para>Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.</para></listitem> + <listitem><para>Operational Management: HBase provides build-in web-pages for operational insight as well as JMX metrics.</para></listitem> + </itemizedlist> + </para> + </section> + + <section xml:id="arch.overview.when"> + <title>When Should I Use HBase?</title> + <para>HBase isn't suitable for every problem.</para> + <para>First, make sure you have enough data. If you have hundreds of millions or billions of rows, then + HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS + might be a better choice due to the fact that all of your data might wind up on a single node (or two) and + the rest of the cluster may be sitting idle. + </para> + <para>Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, + secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be + "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a + complete redesign as opposed to a port. + </para> + <para>Third, make sure you have enough hardware. Even HDFS doesn't do well with anything less than + 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode. + </para> + <para>HBase can run quite well stand-alone on a laptop - but this should be considered a development + configuration only. + </para> + </section> + <section xml:id="arch.overview.hbasehdfs"> + <title>What Is The Difference Between HBase and Hadoop/HDFS?</title> + <para><link xlink:href="http://hadoop.apache.org/hdfs/">HDFS</link> is a distributed file system that is well suited for the storage of large files. + Its documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. + HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. + This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist + on HDFS for high-speed lookups. See the <xref linkend="datamodel" /> and the rest of this chapter for more information on how HBase achieves its goals. + </para> + </section> + </section> + + <section + xml:id="arch.catalog"> + <title>Catalog Tables</title> + <para>The catalog table <code>hbase:meta</code> exists as an HBase table and is filtered out of the HBase + shell's <code>list</code> command, but is in fact a table just like any other. </para> + <section + xml:id="arch.catalog.root"> + <title>-ROOT-</title> + <note> + <para>The <code>-ROOT-</code> table was removed in HBase 0.96.0. Information here should + be considered historical.</para> + </note> + <para>The <code>-ROOT-</code> table kept track of the location of the + <code>.META</code> table (the previous name for the table now called <code>hbase:meta</code>) prior to HBase + 0.96. The <code>-ROOT-</code> table structure was as follows: </para> + <itemizedlist> + <title>Key</title> + <listitem> + <para>.META. region key (<code>.META.,,1</code>)</para> + </listitem> + </itemizedlist> + + <itemizedlist> + <title>Values</title> + <listitem> + <para><code>info:regioninfo</code> (serialized <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html">HRegionInfo</link> + instance of hbase:meta)</para> + </listitem> + <listitem> + <para><code>info:server</code> (server:port of the RegionServer holding + hbase:meta)</para> + </listitem> + <listitem> + <para><code>info:serverstartcode</code> (start-time of the RegionServer process holding + hbase:meta)</para> + </listitem> + </itemizedlist> + </section> + <section + xml:id="arch.catalog.meta"> + <title>hbase:meta</title> + <para>The <code>hbase:meta</code> table (previously called <code>.META.</code>) keeps a list + of all regions in the system. The location of <code>hbase:meta</code> was previously + tracked within the <code>-ROOT-</code> table, but is now stored in Zookeeper.</para> + <para>The <code>hbase:meta</code> table structure is as follows: </para> + <itemizedlist> + <title>Key</title> + <listitem> + <para>Region key of the format (<code>[table],[region start key],[region + id]</code>)</para> + </listitem> + </itemizedlist> + <itemizedlist> + <title>Values</title> + <listitem> + <para><code>info:regioninfo</code> (serialized <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html"> + HRegionInfo</link> instance for this region)</para> + </listitem> + <listitem> + <para><code>info:server</code> (server:port of the RegionServer containing this + region)</para> + </listitem> + <listitem> + <para><code>info:serverstartcode</code> (start-time of the RegionServer process + containing this region)</para> + </listitem> + </itemizedlist> + <para>When a table is in the process of splitting, two other columns will be created, called + <code>info:splitA</code> and <code>info:splitB</code>. These columns represent the two + daughter regions. The values for these columns are also serialized HRegionInfo instances. + After the region has been split, eventually this row will be deleted. </para> + <note> + <title>Note on HRegionInfo</title> + <para>The empty key is used to denote table start and table end. A region with an empty + start key is the first region in a table. If a region has both an empty start and an + empty end key, it is the only region in the table </para> + </note> + <para>In the (hopefully unlikely) event that programmatic processing of catalog metadata is + required, see the <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Writables.html#getHRegionInfo%28byte[]%29">Writables</link> + utility. </para> + </section> + <section + xml:id="arch.catalog.startup"> + <title>Startup Sequencing</title> + <para>First, the location of <code>hbase:meta</code> is looked up in Zookeeper. Next, + <code>hbase:meta</code> is updated with server and startcode values.</para> + <para>For information on region-RegionServer assignment, see <xref + linkend="regions.arch.assignment" />. </para> + </section> + </section> <!-- catalog --> + + <section + xml:id="client"> + <title>Client</title> + <para>The HBase client finds the RegionServers that are serving the particular row range of + interest. It does this by querying the <code>hbase:meta</code> table. See <xref + linkend="arch.catalog.meta" /> for details. After locating the required region(s), the + client contacts the RegionServer serving that region, rather than going through the master, + and issues the read or write request. This information is cached in the client so that + subsequent requests need not go through the lookup process. Should a region be reassigned + either by the master load balancer or because a RegionServer has died, the client will + requery the catalog tables to determine the new location of the user region. </para> + + <para>See <xref + linkend="master.runtime" /> for more information about the impact of the Master on HBase + Client communication. </para> + <para>Administrative functions are done via an instance of <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html">Admin</link> + </para> + + <section + xml:id="client.connections"> + <title>Cluster Connections</title> + <para>The API changed in HBase 1.0. Its been cleaned up and users are returned + Interfaces to work against rather than particular types. In HBase 1.0, + obtain a cluster Connection from ConnectionFactory and thereafter, get from it + instances of Table, Admin, and RegionLocator on an as-need basis. When done, close + obtained instances. Finally, be sure to cleanup your Connection instance before + exiting. Connections are heavyweight objects. Create once and keep an instance around. + Table, Admin and RegionLocator instances are lightweight. Create as you go and then + let go as soon as you are done by closing them. See the + <link xlink:href="/Users/stack/checkouts/hbase.git/target/site/apidocs/org/apache/hadoop/hbase/client/package-summary.html">Client Package Javadoc Description</link> for example usage of the new HBase 1.0 API.</para> + + <para>For connection configuration information, see <xref linkend="client_dependencies" />. </para> + + <para><emphasis><link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html">Table</link> + instances are not thread-safe</emphasis>. Only one thread can use an instance of Table at + any given time. When creating Table instances, it is advisable to use the same <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HBaseConfiguration">HBaseConfiguration</link> + instance. This will ensure sharing of ZooKeeper and socket instances to the RegionServers + which is usually what you want. For example, this is preferred:</para> + <programlisting language="java">HBaseConfiguration conf = HBaseConfiguration.create(); +HTable table1 = new HTable(conf, "myTable"); +HTable table2 = new HTable(conf, "myTable");</programlisting> + <para>as opposed to this:</para> + <programlisting language="java">HBaseConfiguration conf1 = HBaseConfiguration.create(); +HTable table1 = new HTable(conf1, "myTable"); +HBaseConfiguration conf2 = HBaseConfiguration.create(); +HTable table2 = new HTable(conf2, "myTable");</programlisting> + + <para>For more information about how connections are handled in the HBase client, + see <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html">HConnectionManager</link>. + </para> + <section xml:id="client.connection.pooling"><title>Connection Pooling</title> + <para>For applications which require high-end multithreaded access (e.g., web-servers or application servers that may serve many application threads + in a single JVM), you can pre-create an <classname>HConnection</classname>, as shown in + the following example:</para> + <example> + <title>Pre-Creating a <code>HConnection</code></title> + <programlisting language="java">// Create a connection to the cluster. +HConnection connection = HConnectionManager.createConnection(Configuration); +HTableInterface table = connection.getTable("myTable"); +// use table as needed, the table returned is lightweight +table.close(); +// use the connection for other access to the cluster +connection.close();</programlisting> + </example> + <para>Constructing HTableInterface implementation is very lightweight and resources are + controlled.</para> + <warning> + <title><code>HTablePool</code> is Deprecated</title> + <para>Previous versions of this guide discussed <code>HTablePool</code>, which was + deprecated in HBase 0.94, 0.95, and 0.96, and removed in 0.98.1, by <link + xlink:href="https://issues.apache.org/jira/browse/HBASE-6580">HBASE-6500</link>. + Please use <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnection.html"><code>HConnection</code></link> instead.</para> + </warning> + </section> + </section> + <section xml:id="client.writebuffer"><title>WriteBuffer and Batch Methods</title> + <para>If <xref linkend="perf.hbase.client.autoflush" /> is turned off on + <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html">HTable</link>, + <classname>Put</classname>s are sent to RegionServers when the writebuffer + is filled. The writebuffer is 2MB by default. Before an HTable instance is + discarded, either <methodname>close()</methodname> or + <methodname>flushCommits()</methodname> should be invoked so Puts + will not be lost. + </para> + <para>Note: <code>htable.delete(Delete);</code> does not go in the writebuffer! This only applies to Puts. + </para> + <para>For additional information on write durability, review the <link xlink:href="../acid-semantics.html">ACID semantics</link> page. + </para> + <para>For fine-grained control of batching of + <classname>Put</classname>s or <classname>Delete</classname>s, + see the <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#batch%28java.util.List%29">batch</link> methods on HTable. + </para> + </section> + <section xml:id="client.external"><title>External Clients</title> + <para>Information on non-Java clients and custom protocols is covered in <xref linkend="external_apis" /> + </para> + </section> + </section> + + <section xml:id="client.filter"><title>Client Request Filters</title> + <para><link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html">Get</link> and <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link> instances can be + optionally configured with <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.html">filters</link> which are applied on the RegionServer. + </para> + <para>Filters can be confusing because there are many different types, and it is best to approach them by understanding the groups + of Filter functionality. + </para> + <section xml:id="client.filter.structural"><title>Structural</title> + <para>Structural Filters contain other Filters.</para> + <section xml:id="client.filter.structural.fl"><title>FilterList</title> + <para><link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterList.html">FilterList</link> + represents a list of Filters with a relationship of <code>FilterList.Operator.MUST_PASS_ALL</code> or + <code>FilterList.Operator.MUST_PASS_ONE</code> between the Filters. The following example shows an 'or' between two + Filters (checking for either 'my value' or 'my other value' on the same attribute).</para> +<programlisting language="java"> +FilterList list = new FilterList(FilterList.Operator.MUST_PASS_ONE); +SingleColumnValueFilter filter1 = new SingleColumnValueFilter( + cf, + column, + CompareOp.EQUAL, + Bytes.toBytes("my value") + ); +list.add(filter1); +SingleColumnValueFilter filter2 = new SingleColumnValueFilter( + cf, + column, + CompareOp.EQUAL, + Bytes.toBytes("my other value") + ); +list.add(filter2); +scan.setFilter(list); +</programlisting> + </section> + </section> + <section + xml:id="client.filter.cv"> + <title>Column Value</title> + <section + xml:id="client.filter.cv.scvf"> + <title>SingleColumnValueFilter</title> + <para><link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.html">SingleColumnValueFilter</link> + can be used to test column values for equivalence (<code><link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/CompareFilter.CompareOp.html">CompareOp.EQUAL</link> + </code>), inequality (<code>CompareOp.NOT_EQUAL</code>), or ranges (e.g., + <code>CompareOp.GREATER</code>). The following is example of testing equivalence a + column to a String value "my value"...</para> + <programlisting language="java"> +SingleColumnValueFilter filter = new SingleColumnValueFilter( + cf, + column, + CompareOp.EQUAL, + Bytes.toBytes("my value") + ); +scan.setFilter(filter); +</programlisting> + </section> + </section> + <section + xml:id="client.filter.cvp"> + <title>Column Value Comparators</title> + <para>There are several Comparator classes in the Filter package that deserve special + mention. These Comparators are used in concert with other Filters, such as <xref + linkend="client.filter.cv.scvf" />. </para> + <section + xml:id="client.filter.cvp.rcs"> + <title>RegexStringComparator</title> + <para><link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/RegexStringComparator.html">RegexStringComparator</link> + supports regular expressions for value comparisons.</para> + <programlisting language="java"> +RegexStringComparator comp = new RegexStringComparator("my."); // any value that starts with 'my' +SingleColumnValueFilter filter = new SingleColumnValueFilter( + cf, + column, + CompareOp.EQUAL, + comp + ); +scan.setFilter(filter); +</programlisting> + <para>See the Oracle JavaDoc for <link + xlink:href="http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html">supported + RegEx patterns in Java</link>. </para> + </section> + <section + xml:id="client.filter.cvp.SubStringComparator"> + <title>SubstringComparator</title> + <para><link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SubstringComparator.html">SubstringComparator</link> + can be used to determine if a given substring exists in a value. The comparison is + case-insensitive. </para> + <programlisting language="java"> +SubstringComparator comp = new SubstringComparator("y val"); // looking for 'my value' +SingleColumnValueFilter filter = new SingleColumnValueFilter( + cf, + column, + CompareOp.EQUAL, + comp + ); +scan.setFilter(filter); +</programlisting> + </section> + <section + xml:id="client.filter.cvp.bfp"> + <title>BinaryPrefixComparator</title> + <para>See <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/BinaryPrefixComparator.html">BinaryPrefixComparator</link>.</para> + </section> + <section + xml:id="client.filter.cvp.bc"> + <title>BinaryComparator</title> + <para>See <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/BinaryComparator.html">BinaryComparator</link>.</para> + </section> + </section> + <section + xml:id="client.filter.kvm"> + <title>KeyValue Metadata</title> + <para>As HBase stores data internally as KeyValue pairs, KeyValue Metadata Filters evaluate + the existence of keys (i.e., ColumnFamily:Column qualifiers) for a row, as opposed to + values the previous section. </para> + <section + xml:id="client.filter.kvm.ff"> + <title>FamilyFilter</title> + <para><link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FamilyFilter.html">FamilyFilter</link> + can be used to filter on the ColumnFamily. It is generally a better idea to select + ColumnFamilies in the Scan than to do it with a Filter.</para> + </section> + <section + xml:id="client.filter.kvm.qf"> + <title>QualifierFilter</title> + <para><link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/QualifierFilter.html">QualifierFilter</link> + can be used to filter based on Column (aka Qualifier) name. </para> + </section> + <section + xml:id="client.filter.kvm.cpf"> + <title>ColumnPrefixFilter</title> + <para><link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnPrefixFilter.html">ColumnPrefixFilter</link> + can be used to filter based on the lead portion of Column (aka Qualifier) names. </para> + <para>A ColumnPrefixFilter seeks ahead to the first column matching the prefix in each row + and for each involved column family. It can be used to efficiently get a subset of the + columns in very wide rows. </para> + <para>Note: The same column qualifier can be used in different column families. This + filter returns all matching columns. </para> + <para>Example: Find all columns in a row and family that start with "abc"</para> + <programlisting language="java"> +HTableInterface t = ...; +byte[] row = ...; +byte[] family = ...; +byte[] prefix = Bytes.toBytes("abc"); +Scan scan = new Scan(row, row); // (optional) limit to one row +scan.addFamily(family); // (optional) limit to one family +Filter f = new ColumnPrefixFilter(prefix); +scan.setFilter(f); +scan.setBatch(10); // set this if there could be many columns returned +ResultScanner rs = t.getScanner(scan); +for (Result r = rs.next(); r != null; r = rs.next()) { + for (KeyValue kv : r.raw()) { + // each kv represents a column + } +} +rs.close(); +</programlisting> + </section> + <section + xml:id="client.filter.kvm.mcpf"> + <title>MultipleColumnPrefixFilter</title> + <para><link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/MultipleColumnPrefixFilter.html">MultipleColumnPrefixFilter</link> + behaves like ColumnPrefixFilter but allows specifying multiple prefixes. </para> + <para>Like ColumnPrefixFilter, MultipleColumnPrefixFilter efficiently seeks ahead to the + first column matching the lowest prefix and also seeks past ranges of columns between + prefixes. It can be used to efficiently get discontinuous sets of columns from very wide + rows. </para> + <para>Example: Find all columns in a row and family that start with "abc" or "xyz"</para> + <programlisting language="java"> +HTableInterface t = ...; +byte[] row = ...; +byte[] family = ...; +byte[][] prefixes = new byte[][] {Bytes.toBytes("abc"), Bytes.toBytes("xyz")}; +Scan scan = new Scan(row, row); // (optional) limit to one row +scan.addFamily(family); // (optional) limit to one family +Filter f = new MultipleColumnPrefixFilter(prefixes); +scan.setFilter(f); +scan.setBatch(10); // set this if there could be many columns returned +ResultScanner rs = t.getScanner(scan); +for (Result r = rs.next(); r != null; r = rs.next()) { + for (KeyValue kv : r.raw()) { + // each kv represents a column + } +} +rs.close(); +</programlisting> + </section> + <section + xml:id="client.filter.kvm.crf "> + <title>ColumnRangeFilter</title> + <para>A <link + xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnRangeFilter.html">ColumnRangeFilter</link> + allows efficient intra row scanning. </para> + <para>A ColumnRangeFilter can seek ahead to the first matching column for each involved + column family. It can be used to efficiently get a 'slice' of the columns of a very wide + row. i.e. you have a million columns in a row but you only want to look at columns + bbbb-bbdd. </para> + <para>Note: The same column qualifier can be used in different column families. This + filter returns all matching columns. </para> + <para>Example: Find all columns in a row and family between "bbbb" (inclusive) and "bbdd" + (inclusive)</para> + <programlisting language="java"> +HTableInterface t = ...; +byte[] row = ...; +byte[] family = ...; +byte[] startColumn = Bytes.toBytes("bbbb"); +byte[] endColumn = Bytes.toBytes("bbdd"); +Scan scan = new Scan(row, row); // (optional) limit to one row +scan.addFamily(family); // (optional) limit to one family +Filter f = new ColumnRangeFilter(startColumn, true, endColumn, true); +scan.setFilter(f); +scan.setBatch(10); // set this if there could be many columns returned +ResultScanner rs = t.getScanner(scan); +for (Result r = rs.next(); r != null; r = rs.next()) { + for (KeyValue kv : r.raw()) { + // each kv represents a column + } +} +rs.close(); +</programlisting> + <para>Note: Introduced in HBase 0.92</para> + </section> + </section> + <section xml:id="client.filter.row"><title>RowKey</title> + <section xml:id="client.filter.row.rf"><title>RowFilter</title> + <para>It is generally a better idea to use the startRow/stopRow methods on Scan for row selection, however + <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/RowFilter.html">RowFilter</link> can also be used.</para> + </section> + </section> + <section xml:id="client.filter.utility"><title>Utility</title> + <section xml:id="client.filter.utility.fkof"><title>FirstKeyOnlyFilter</title> + <para>This is primarily used for rowcount jobs. + See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FirstKeyOnlyFilter.html">FirstKeyOnlyFilter</link>.</para> + </section> + </section> + </section> <!-- client.filter --> + + <section xml:id="master"><title>Master</title> + <para><code>HMaster</code> is the implementation of the Master Server. The Master server is + responsible for monitoring all RegionServer instances in the cluster, and is the interface + for all metadata changes. In a distributed cluster, the Master typically runs on the <xref + linkend="arch.hdfs.nn"/>. J Mohamed Zahoor goes into some more detail on the Master + Architecture in this blog posting, <link + xlink:href="http://blog.zahoor.in/2012/08/hbase-hmaster-architecture/">HBase HMaster + Architecture </link>.</para> + <section xml:id="master.startup"><title>Startup Behavior</title> + <para>If run in a multi-Master environment, all Masters compete to run the cluster. If the active + Master loses its lease in ZooKeeper (or the Master shuts down), then then the remaining Masters jostle to + take over the Master role. + </para> + </section> + <section + xml:id="master.runtime"> + <title>Runtime Impact</title> + <para>A common dist-list question involves what happens to an HBase cluster when the Master + goes down. Because the HBase client talks directly to the RegionServers, the cluster can + still function in a "steady state." Additionally, per <xref + linkend="arch.catalog" />, <code>hbase:meta</code> exists as an HBase table and is not + resident in the Master. However, the Master controls critical functions such as + RegionServer failover and completing region splits. So while the cluster can still run for + a short time without the Master, the Master should be restarted as soon as possible. + </para> + </section> + <section xml:id="master.api"><title>Interface</title> + <para>The methods exposed by <code>HMasterInterface</code> are primarily metadata-oriented methods: + <itemizedlist> + <listitem><para>Table (createTable, modifyTable, removeTable, enable, disable) + </para></listitem> + <listitem><para>ColumnFamily (addColumn, modifyColumn, removeColumn) + </para></listitem> + <listitem><para>Region (move, assign, unassign) + </para></listitem> + </itemizedlist> + For example, when the <code>HBaseAdmin</code> method <code>disableTable</code> is invoked, it is serviced by the Master server. + </para> + </section> + <section xml:id="master.processes"><title>Processes</title> + <para>The Master runs several background threads: + </para> + <section xml:id="master.processes.loadbalancer"><title>LoadBalancer</title> + <para>Periodically, and when there are no regions in transition, + a load balancer will run and move regions around to balance the cluster's load. + See <xref linkend="balancer_config" /> for configuring this property.</para> + <para>See <xref linkend="regions.arch.assignment"/> for more information on region assignment. + </para> + </section> + <section xml:id="master.processes.catalog"><title>CatalogJanitor</title> + <para>Periodically checks and cleans up the hbase:meta table. See <xref linkend="arch.catalog.meta" /> for more information on META.</para> + </section> + </section> + + </section> + <section + xml:id="regionserver.arch"> + <title>RegionServer</title> + <para><code>HRegionServer</code> is the RegionServer implementation. It is responsible for + serving and managing regions. In a distributed cluster, a RegionServer runs on a <xref + linkend="arch.hdfs.dn" />. </para> + <section + xml:id="regionserver.arch.api"> + <title>Interface</title> + <para>The methods exposed by <code>HRegionRegionInterface</code> contain both data-oriented + and region-maintenance methods: <itemizedlist> + <listitem> + <para>Data (get, put, delete, next, etc.)</para> + </listitem> + <listitem> + <para>Region (splitRegion, compactRegion, etc.)</para> + </listitem> + </itemizedlist> For example, when the <code>HBaseAdmin</code> method + <code>majorCompact</code> is invoked on a table, the client is actually iterating + through all regions for the specified table and requesting a major compaction directly to + each region. </para> + </section> + <section + xml:id="regionserver.arch.processes"> + <title>Processes</title> + <para>The RegionServer runs a variety of background threads:</para> + <section + xml:id="regionserver.arch.processes.compactsplit"> + <title>CompactSplitThread</title> + <para>Checks for splits and handle minor compactions.</para> + </section> + <section + xml:id="regionserver.arch.processes.majorcompact"> + <title>MajorCompactionChecker</title> + <para>Checks for major compactions.</para> + </section> + <section + xml:id="regionserver.arch.processes.memstore"> + <title>MemStoreFlusher</title> + <para>Periodically flushes in-memory writes in the MemStore to StoreFiles.</para> + </section> + <section + xml:id="regionserver.arch.processes.log"> + <title>LogRoller</title> + <para>Periodically checks the RegionServer's WAL.</para> + </section> + </section> + + <section + xml:id="coprocessors"> + <title>Coprocessors</title> + <para>Coprocessors were added in 0.92. There is a thorough <link + xlink:href="https://blogs.apache.org/hbase/entry/coprocessor_introduction">Blog Overview + of CoProcessors</link> posted. Documentation will eventually move to this reference + guide, but the blog is the most current information available at this time. </para> + </section> + + <section + xml:id="block.cache"> + <title>Block Cache</title> + + <para>HBase provides two different BlockCache implementations: the default onheap + LruBlockCache and BucketCache, which is (usually) offheap. This section + discusses benefits and drawbacks of each implementation, how to choose the appropriate + option, and configuration options for each.</para> + + <note><title>Block Cache Reporting: UI</title> + <para>See the RegionServer UI for detail on caching deploy. Since HBase-0.98.4, the + Block Cache detail has been significantly extended showing configurations, + sizings, current usage, time-in-the-cache, and even detail on block counts and types.</para> + </note> + + <section> + + <title>Cache Choices</title> + <para><classname>LruBlockCache</classname> is the original implementation, and is + entirely within the Java heap. <classname>BucketCache</classname> is mainly + intended for keeping blockcache data offheap, although BucketCache can also + keep data onheap and serve from a file-backed cache. + <note><title>BucketCache is production ready as of hbase-0.98.6</title> + <para>To run with BucketCache, you need HBASE-11678. This was included in + hbase-0.98.6. + </para> + </note> + </para> + + <para>Fetching will always be slower when fetching from BucketCache, + as compared to the native onheap LruBlockCache. However, latencies tend to be + less erratic across time, because there is less garbage collection when you use + BucketCache since it is managing BlockCache allocations, not the GC. If the + BucketCache is deployed in offheap mode, this memory is not managed by the + GC at all. This is why you'd use BucketCache, so your latencies are less erratic and to mitigate GCs + and heap fragmentation. See Nick Dimiduk's <link + xlink:href="http://www.n10k.com/blog/blockcache-101/">BlockCache 101</link> for + comparisons running onheap vs offheap tests. Also see + <link xlink:href="http://people.apache.org/~stack/bc/">Comparing BlockCache Deploys</link> + which finds that if your dataset fits inside your LruBlockCache deploy, use it otherwise + if you are experiencing cache churn (or you want your cache to exist beyond the + vagaries of java GC), use BucketCache. + </para> + + <para>When you enable BucketCache, you are enabling a two tier caching + system, an L1 cache which is implemented by an instance of LruBlockCache and + an offheap L2 cache which is implemented by BucketCache. Management of these + two tiers and the policy that dictates how blocks move between them is done by + <classname>CombinedBlockCache</classname>. It keeps all DATA blocks in the L2 + BucketCache and meta blocks -- INDEX and BLOOM blocks -- + onheap in the L1 <classname>LruBlockCache</classname>. + See <xref linkend="offheap.blockcache" /> for more detail on going offheap.</para> + </section> + + <section xml:id="cache.configurations"> + <title>General Cache Configurations</title> + <para>Apart from the cache implementation itself, you can set some general configuration + options to control how the cache performs. See <link + xlink:href="http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html" + />. After setting any of these options, restart or rolling restart your cluster for the + configuration to take effect. Check logs for errors or unexpected behavior.</para> + <para>See also <xref linkend="blockcache.prefetch"/>, which discusses a new option + introduced in <link xlink:href="https://issues.apache.org/jira/browse/HBASE-9857" + >HBASE-9857</link>.</para> + </section> + + <section + xml:id="block.cache.design"> + <title>LruBlockCache Design</title> + <para>The LruBlockCache is an LRU cache that contains three levels of block priority to + allow for scan-resistance and in-memory ColumnFamilies: </para> + <itemizedlist> + <listitem> + <para>Single access priority: The first time a block is loaded from HDFS it normally + has this priority and it will be part of the first group to be considered during + evictions. The advantage is that scanned blocks are more likely to get evicted than + blocks that are getting more usage.</para> + </listitem> + <listitem> + <para>Mutli access priority: If a block in the previous priority group is accessed + again, it upgrades to this priority. It is thus part of the second group considered + during evictions.</para> + </listitem> + <listitem xml:id="hbase.cache.inmemory"> + <para>In-memory access priority: If the block's family was configured to be + "in-memory", it will be part of this priority disregarding the number of times it + was accessed. Catalog tables are configured like this. This group is the last one + considered during evictions.</para> + <para>To mark a column family as in-memory, call + <programlisting language="java">HColumnDescriptor.setInMemory(true);</programlisting> if creating a table from java, + or set <command>IN_MEMORY => true</command> when creating or altering a table in + the shell: e.g. <programlisting>hbase(main):003:0> create 't', {NAME => 'f', IN_MEMORY => 'true'}</programlisting></para> + </listitem> + </itemizedlist> + <para> For more information, see the <link + xlink:href="http://hbase.apache.org/xref/org/apache/hadoop/hbase/io/hfile/LruBlockCache.html">LruBlockCache + source</link> + </para> + </section> + <section + xml:id="block.cache.usage"> + <title>LruBlockCache Usage</title> + <para>Block caching is enabled by default for all the user tables which means that any + read operation will load the LRU cache. This might be good for a large number of use + cases, but further tunings are usually required in order to achieve better performance. + An important concept is the <link + xlink:href="http://en.wikipedia.org/wiki/Working_set_size">working set size</link>, or + WSS, which is: "the amount of memory needed to compute the answer to a problem". For a + website, this would be the data that's needed to answer the queries over a short amount + of time. </para> + <para>The way to calculate how much memory is available in HBase for caching is: </para> + <programlisting> + number of region servers * heap size * hfile.block.cache.size * 0.99 + </programlisting> + <para>The default value for the block cache is 0.25 which represents 25% of the available + heap. The last value (99%) is the default acceptable loading factor in the LRU cache + after which eviction is started. The reason it is included in this equation is that it + would be unrealistic to say that it is possible to use 100% of the available memory + since this would make the process blocking from the point where it loads new blocks. + Here are some examples: </para> + <itemizedlist> + <listitem> + <para>One region server with the default heap size (1 GB) and the default block cache + size will have 253 MB of block cache available.</para> + </listitem> + <listitem> + <para>20 region servers with the heap size set to 8 GB and a default block cache size + will have 39.6 of block cache.</para> + </listitem> + <listitem> + <para>100 region servers with the heap size set to 24 GB and a block cache size of 0.5 + will have about 1.16 TB of block cache.</para> + </listitem> + </itemizedlist> + <para>Your data is not the only resident of the block cache. Here are others that you may have to take into account: + </para> + <variablelist> + <varlistentry> + <term>Catalog Tables</term> + <listitem> + <para>The <code>-ROOT-</code> (prior to HBase 0.96. See <xref + linkend="arch.catalog.root" />) and <code>hbase:meta</code> tables are forced + into the block cache and have the in-memory priority which means that they are + harder to evict. The former never uses more than a few hundreds of bytes while the + latter can occupy a few MBs (depending on the number of regions).</para> + </listitem> + </varlistentry> + <varlistentry> + <term>HFiles Indexes</term> + <listitem> + <para>An <firstterm>hfile</firstterm> is the file format that HBase uses to store + data in HDFS. It contains a multi-layered index which allows HBase to seek to the + data without having to read the whole file. The size of those indexes is a factor + of the block size (64KB by default), the size of your keys and the amount of data + you are storing. For big data sets it's not unusual to see numbers around 1GB per + region server, although not all of it will be in cache because the LRU will evict + indexes that aren't used.</para> + </listitem> + </varlistentry> + <varlistentry> + <term>Keys</term> + <listitem> + <para>The values that are stored are only half the picture, since each value is + stored along with its keys (row key, family qualifier, and timestamp). See <xref + linkend="keysize" />.</para> + </listitem> + </varlistentry> + <varlistentry> + <term>Bloom Filters</term> + <listitem> + <para>Just like the HFile indexes, those data structures (when enabled) are stored + in the LRU.</para> + </listitem> + </varlistentry> + </variablelist> + <para>Currently the recommended way to measure HFile indexes and bloom filters sizes is to + look at the region server web UI and checkout the relevant metrics. For keys, sampling + can be done by using the HFile command line tool and look for the average key size + metric. Since HBase 0.98.3, you can view detail on BlockCache stats and metrics + in a special Block Cache section in the UI.</para> + <para>It's generally bad to use block caching when the WSS doesn't fit in memory. This is + the case when you have for example 40GB available across all your region servers' block + caches but you need to process 1TB of data. One of the reasons is that the churn + generated by the evictions will trigger more garbage collections unnecessarily. Here are + two use cases: </para> + <itemizedlist> + <listitem> + <para>Fully random reading pattern: This is a case where you almost never access the + same row twice within a short amount of time such that the chance of hitting a + cached block is close to 0. Setting block caching on such a table is a waste of + memory and CPU cycles, more so that it will generate more garbage to pick up by the + JVM. For more information on monitoring GC, see <xref + linkend="trouble.log.gc" />.</para> + </listitem> + <listitem> + <para>Mapping a table: In a typical MapReduce job that takes a table in input, every + row will be read only once so there's no need to put them into the block cache. The + Scan object has the option of turning this off via the setCaching method (set it to + false). You can still keep block caching turned on on this table if you need fast + random read access. An example would be counting the number of rows in a table that + serves live traffic, caching every block of that table would create massive churn + and would surely evict data that's currently in use. </para> + </listitem> + </itemizedlist> + <section xml:id="data.blocks.in.fscache"> + <title>Caching META blocks only (DATA blocks in fscache)</title> + <para>An interesting setup is one where we cache META blocks only and we read DATA + blocks in on each access. If the DATA blocks fit inside fscache, this alternative + may make sense when access is completely random across a very large dataset. + To enable this setup, alter your table and for each column family + set <varname>BLOCKCACHE => 'false'</varname>. You are 'disabling' the + BlockCache for this column family only you can never disable the caching of + META blocks. Since + <link xlink:href="https://issues.apache.org/jira/browse/HBASE-4683">HBASE-4683 Always cache index and bloom blocks</link>, + we will cache META blocks even if the BlockCache is disabled. + </para> + </section> + </section> + <section + xml:id="offheap.blockcache"> + <title>Offheap Block Cache</title> + <section xml:id="enable.bucketcache"> + <title>How to Enable BucketCache</title> + <para>The usual deploy of BucketCache is via a managing class that sets up two caching tiers: an L1 onheap cache + implemented by LruBlockCache and a second L2 cache implemented with BucketCache. The managing class is <link + xlink:href="http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CombinedBlockCache.html">CombinedBlockCache</link> by default. + The just-previous link describes the caching 'policy' implemented by CombinedBlockCache. In short, it works + by keeping meta blocks -- INDEX and BLOOM in the L1, onheap LruBlockCache tier -- and DATA + blocks are kept in the L2, BucketCache tier. It is possible to amend this behavior in + HBase since version 1.0 and ask that a column family have both its meta and DATA blocks hosted onheap in the L1 tier by + setting <varname>cacheDataInL1</varname> via + <code>(HColumnDescriptor.setCacheDataInL1(true)</code> + or in the shell, creating or amending column families setting <varname>CACHE_DATA_IN_L1</varname> + to true: e.g. <programlisting>hbase(main):003:0> create 't', {NAME => 't', CONFIGURATION => {CACHE_DATA_IN_L1 => 'true'}}</programlisting></para> + + <para>The BucketCache Block Cache can be deployed onheap, offheap, or file based. + You set which via the + <varname>hbase.bucketcache.ioengine</varname> setting. Setting it to + <varname>heap</varname> will have BucketCache deployed inside the + allocated java heap. Setting it to <varname>offheap</varname> will have + BucketCache make its allocations offheap, + and an ioengine setting of <varname>file:PATH_TO_FILE</varname> will direct + BucketCache to use a file caching (Useful in particular if you have some fast i/o attached to the box such + as SSDs). + </para> + <para xml:id="raw.l1.l2">It is possible to deploy an L1+L2 setup where we bypass the CombinedBlockCache + policy and have BucketCache working as a strict L2 cache to the L1 + LruBlockCache. For such a setup, set <varname>CacheConfig.BUCKET_CACHE_COMBINED_KEY</varname> to + <literal>false</literal>. In this mode, on eviction from L1, blocks go to L2. + When a block is cached, it is cached first in L1. When we go to look for a cached block, + we look first in L1 and if none found, then search L2. Let us call this deploy format, + <emphasis><indexterm><primary>Raw L1+L2</primary></indexterm></emphasis>.</para> + <para>Other BucketCache configs include: specifying a location to persist cache to across + restarts, how many threads to use writing the cache, etc. See the + <link xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html">CacheConfig.html</link> + class for configuration options and descriptions.</para> + + <procedure> + <title>BucketCache Example Configuration</title> + <para>This sample provides a configuration for a 4 GB offheap BucketCache with a 1 GB + onheap cache. Configuration is performed on the RegionServer. Setting + <varname>hbase.bucketcache.ioengine</varname> and + <varname>hbase.bucketcache.size</varname> > 0 enables CombinedBlockCache. + Let us presume that the RegionServer has been set to run with a 5G heap: + i.e. HBASE_HEAPSIZE=5g. + </para> + <step> + <para>First, edit the RegionServer's <filename>hbase-env.sh</filename> and set + <varname>HBASE_OFFHEAPSIZE</varname> to a value greater than the offheap size wanted, in + this case, 4 GB (expressed as 4G). Lets set it to 5G. That'll be 4G + for our offheap cache and 1G for any other uses of offheap memory (there are + other users of offheap memory other than BlockCache; e.g. DFSClient + in RegionServer can make use of offheap memory). See <xref linkend="direct.memory" />.</para> + <programlisting>HBASE_OFFHEAPSIZE=5G</programlisting> + </step> + <step> + <para>Next, add the following configuration to the RegionServer's + <filename>hbase-site.xml</filename>.</para> + <programlisting language="xml"> +<![CDATA[<property> + <name>hbase.bucketcache.ioengine</name> + <value>offheap</value> +</property> +<property> + <name>hfile.block.cache.size</name> + <value>0.2</value> +</property> +<property> + <name>hbase.bucketcache.size</name> + <value>4196</value> +</property>]]> + </programlisting> + </step> + <step> + <para>Restart or rolling restart your cluster, and check the logs for any + issues.</para> + </step> + </procedure> + <para>In the above, we set bucketcache to be 4G. The onheap lrublockcache we + configured to have 0.2 of the RegionServer's heap size (0.2 * 5G = 1G). + In other words, you configure the L1 LruBlockCache as you would normally, + as you would when there is no L2 BucketCache present. + </para> + <para><link xlink:href="https://issues.apache.org/jira/browse/HBASE-10641" + >HBASE-10641</link> introduced the ability to configure multiple sizes for the + buckets of the bucketcache, in HBase 0.98 and newer. To configurable multiple bucket + sizes, configure the new property <option>hfile.block.cache.sizes</option> (instead of + <option>hfile.block.cache.size</option>) to a comma-separated list of block sizes, + ordered from smallest to largest, with no spaces. The goal is to optimize the bucket + sizes based on your data access patterns. The following example configures buckets of + size 4096 and 8192.</para> + <screen language="xml"><![CDATA[ +<property> + <name>hfile.block.cache.sizes</name> + <value>4096,8192</value> +</property> + ]]></screen> + <note xml:id="direct.memory"> + <title>Direct Memory Usage In HBase</title> + <para>The default maximum direct memory varies by JVM. Traditionally it is 64M + or some relation to allocated heap size (-Xmx) or no limit at all (JDK7 apparently). + HBase servers use direct memory, in particular short-circuit reading, the hosted DFSClient will + allocate direct memory buffers. If you do offheap block caching, you'll + be making use of direct memory. Starting your JVM, make sure + the <varname>-XX:MaxDirectMemorySize</varname> setting in + <filename>conf/hbase-env.sh</filename> is set to some value that is + higher than what you have allocated to your offheap blockcache + (<varname>hbase.bucketcache.size</varname>). It should be larger than your offheap block + cache and then some for DFSClient usage (How much the DFSClient uses is not + easy to quantify; it is the number of open hfiles * <varname>hbase.dfs.client.read.shortcircuit.buffer.size</varname> + where hbase.dfs.client.read.shortcircuit.buffer.size is set to 128k in HBase -- see <filename>hbase-default.xml</filename> + default configurations). + Direct memory, which is part of the Java process heap, is separate from the object + heap allocated by -Xmx. The value allocated by MaxDirectMemorySize must not exceed + physical RAM, and is likely to be less than the total available RAM due to other + memory requirements and system constraints. + </para> + <para>You can see how much memory -- onheap and offheap/direct -- a RegionServer is + configured to use and how much it is using at any one time by looking at the + <emphasis>Server Metrics: Memory</emphasis> tab in the UI. It can also be gotten + via JMX. In particular the direct memory currently used by the server can be found + on the <varname>java.nio.type=BufferPool,name=direct</varname> bean. Terracotta has + a <link + xlink:href="http://terracotta.org/documentation/4.0/bigmemorygo/configuration/storage-options" + >good write up</link> on using offheap memory in java. It is for their product + BigMemory but alot of the issues noted apply in general to any attempt at going + offheap. Check it out.</para> + </note> + <note xml:id="hbase.bucketcache.percentage.in.combinedcache"><title>hbase.bucketcache.percentage.in.combinedcache</title> + <para>This is a pre-HBase 1.0 configuration removed because it + was confusing. It was a float that you would set to some value + between 0.0 and 1.0. Its default was 0.9. If the deploy was using + CombinedBlockCache, then the LruBlockCache L1 size was calculated to + be (1 - <varname>hbase.bucketcache.percentage.in.combinedcache</varname>) * <varname>size-of-bucketcache</varname> + and the BucketCache size was <varname>hbase.bucketcache.percentage.in.combinedcache</varname> * size-of-bucket-cache. + where size-of-bucket-cache itself is EITHER the value of the configuration hbase.bucketcache.size + IF it was specified as megabytes OR <varname>hbase.bucketcache.size</varname> * <varname>-XX:MaxDirectMemorySize</varname> if + <varname>hbase.bucketcache.size</varname> between 0 and 1.0. + </para> + <para>In 1.0, it should be more straight-forward. L1 LruBlockCache size + is set as a fraction of java heap using hfile.block.cache.size setting + (not the best name) and L2 is set as above either in absolute + megabytes or as a fraction of allocated maximum direct memory. + </para> + </note> + </section> + </section> + <section> + <title>Comprewssed Blockcache</title> + <para><link xlink:href="https://issues.apache.org/jira/browse/HBASE-11331" + >HBASE-11331</link> introduced lazy blockcache decompression, more simply referred to + as compressed blockcache. When compressed blockcache is enabled. data and encoded data + blocks are cached in the blockcache in their on-disk format, rather than being + decompressed and decrypted before caching.</para> + <para xlink:href="https://issues.apache.org/jira/browse/HBASE-11331">For a RegionServer + hosting more data than can fit into cache, enabling this feature with SNAPPY compression + has been shown to result in 50% increase in throughput and 30% improvement in mean + latency while, increasing garbage collection by 80% and increasing overall CPU load by + 2%. See HBASE-11331 for more details about how performance was measured and achieved. + For a RegionServer hosting data that can comfortably fit into cache, or if your workload + is sensitive to extra CPU or garbage-collection load, you may receive less + benefit.</para> + <para>Compressed blockcache is disabled by default. To enable it, set + <code>hbase.block.data.cachecompressed</code> to <code>true</code> in + <filename>hbase-site.xml</filename> on all RegionServers.</para> + </section> + </section> + + <section + xml:id="wal"> + <title>Write Ahead Log (WAL)</title> + + <section + xml:id="purpose.wal"> + <title>Purpose</title> + <para>The <firstterm>Write Ahead Log (WAL)</firstterm> records all changes to data in + HBase, to file-based storage. Under normal operations, the WAL is not needed because + data changes move from the MemStore to StoreFiles. However, if a RegionServer crashes or + becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to + the data can be replayed. If writing to the WAL fails, the entire operation to modify the + data fails.</para> + <para> + HBase uses an implementation of the <link xlink:href= + "http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/wal/WAL.html" + >WAL</link> interface. Usually, there is only one instance of a WAL per RegionServer. + The RegionServer records Puts and Deletes to it, before recording them to the <xref + linkend="store.memstore" /> for the affected <xref + linkend="store" />. + </para> + <note> + <title>The HLog</title> + <para> + Prior to 2.0, the interface for WALs in HBase was named <classname>HLog</classname>. + In 0.94, HLog was the name of the implementation of the WAL. You will likely find + references to the HLog in documentation tailored to these older versions. + </para> + </note> + <para>The WAL resides in HDFS in the <filename>/hbase/WALs/</filename> directory (prior to + HBase 0.94, they were stored in <filename>/hbase/.logs/</filename>), with subdirectories per + region.</para> + <para> For more general information about the concept of write ahead logs, see the + Wikipedia <link + xlink:href="http://en.wikipedia.org/wiki/Write-ahead_logging">Write-Ahead Log</link> + article. </para> + </section> + <section + xml:id="wal_flush"> + <title>WAL Flushing</title> + <para>TODO (describe). </para> + </section> + + <section + xml:id="wal_splitting"> + <title>WAL Splitting</title> + + <para>A RegionServer serves many regions. All of the regions in a region server share the + same active WAL file. Each edit in the WAL file includes information about which region + it belongs to. When a region is opened, the edits in the WAL file which belong to that + region need to be replayed. Therefore, edits in the WAL file must be grouped by region + so that particular sets can be replayed to regenerate the data in a particular region. + The process of grouping the WAL edits by region is called <firstterm>log + splitting</firstterm>. It is a critical process for recovering data if a region server + fails.</para> + <para>Log splitting is done by the HMaster during cluster start-up or by the ServerShutdownHandler + as a region server shuts down. So that consistency is guaranteed, affected regions + are unavailable until data is restored. All WAL edits need to be recovered and replayed + before a given region can become available again. As a result, regions affected by + log splitting are unavailable until the process completes.</para> + <procedure xml:id="log.splitting.step.by.step"> + <title>Log Splitting, Step by Step</title> + <step> + <title>The <filename>/hbase/WALs/<host>,<port>,<startcode></filename> directory is renamed.</title> + <para>Renaming the directory is important because a RegionServer may still be up and + accepting requests even if the HMaster thinks it is down. If the RegionServer does + not respond immediately and does not heartbeat its ZooKeeper session, the HMaster + may interpret this as a RegionServer failure. Renaming the logs directory ensures + that existing, valid WAL files which are still in use by an active but busy + RegionServer are not written to by accident.</para> + <para>The new directory is named according to the following pattern:</para> + <screen><![CDATA[/hbase/WALs/<host>,<port>,<startcode>-splitting]]></screen> + <para>An example of such a renamed directory might look like the following:</para> + <screen>/hbase/WALs/srv.example.com,60020,1254173957298-splitting</screen> + </step> + <step> + <title>Each log file is split, one at a time.</title> + <para>The log splitter reads the log file one edit entry at a time and puts each edit + entry into the buffer corresponding to the editâs region. At the same time, the + splitter starts several writer threads. Writer threads pick up a corresponding + buffer and write the edit entries in the buffer to a temporary recovered edit + file. The temporary edit file is stored to disk with the following naming pattern:</para> + <screen><![CDATA[/hbase/<table_name>/<region_id>/recovered.edits/.temp]]></screen> + <para>This file is used to store all the edits in the WAL log for this region. After + log splitting completes, the <filename>.temp</filename> file is renamed to the + sequence ID of the first log written to the file.</para> + <para>To determine whether all edits have been written, the sequence ID is compared to + the sequence of the last edit that was written to the HFile. If the sequence of the + last edit is greater than or equal to the sequence ID included in the file name, it + is clear that all writes from the edit file have been completed.</para> + </step> + <step> + <title>After log splitting is complete, each affected region is assigned to a + RegionServer.</title> + <para> When the region is opened, the <filename>recovered.edits</filename> folder is checked for recovered + edits files. If any such files are present, they are replayed by reading the edits + and saving them to the MemStore. After all edit files are replayed, the contents of + the MemStore are written to disk (HFile) and the edit files are deleted.</para> + </step> + </procedure> + + <section> + <title>Handling of Errors During Log Splitting</title> + + <para>If you set the <varname>hbase.hlog.split.skip.errors</varname> option to + <constant>true</constant>, errors are treated as follows:</para> + <itemizedlist> + <listitem> + <para>Any error encountered during splitting will be logged.</para> + </listitem> + <listitem> + <para>The problematic WAL log will be moved into the <filename>.corrupt</filename> + directory under the hbase <varname>rootdir</varname>,</para> + </listitem> + <listitem> + <para>Processing of the WAL will continue</para> + </listitem> + </itemizedlist> + <para>If the <varname>hbase.hlog.split.skip.errors</varname> optionset to + <literal>false</literal>, the default, the exception will be propagated and the + split will be logged as failed. See <link + xlink:href="https://issues.apache.org/jira/browse/HBASE-2958">HBASE-2958 When + hbase.hlog.split.skip.errors is set to false, we fail the split but thats + it</link>. We need to do more than just fail split if this flag is set.</para> + + <section> + <title>How EOFExceptions are treated when splitting a crashed RegionServers' + WALs</title> + + <para>If an EOFException occurs while splitting logs, the split proceeds even when + <varname>hbase.hlog.split.skip.errors</varname> is set to + <literal>false</literal>. An EOFException while reading the last log in the set of + files to split is likely, because the RegionServer is likely to be in the process of + writing a record at the time of a crash. For background, see <link + xlink:href="https://issues.apache.org/jira/browse/HBASE-2643">HBASE-2643 + Figure how to deal with eof splitting logs</link></para> + </section> + </section> + + <section> + <title>Performance Improvements during Log Splitting</title> + <para> + WAL log splitting and recovery can be resource intensive and take a long time, + depending on the number of RegionServers involved in the crash and the size of the + regions. <xref linkend="distributed.log.splitting" /> and <xref + linkend="distributed.log.replay" /> were developed to improve + performance during log splitting. + </para> + <section xml:id="distributed.log.splitting"> + <title>Distributed Log Splitting</title> + <para><firstterm>Distributed Log Splitting</firstterm> was added in HBase version 0.92 + (<link xlink:href="https://issues.apache.org/jira/browse/HBASE-1364">HBASE-1364</link>) + by Prakash Khemani from Facebook. It reduces the time to complete log splitting + dramatically, improving the availability of regions and tables. For + example, recovering a crashed cluster took around 9 hours with single-threaded log + splitting, but only about six minutes with distributed log splitting.</para> + <para>The information in this section is sourced from Jimmy Xiang's blog post at <link + xlink:href="http://blog.cloudera.com/blog/2012/07/hbase-log-splitting/" />.</para> + + <formalpara> + <title>Enabling or Disabling Distributed Log Splitting</title> + <para>Distributed log processing is enabled by default since HBase 0.92. The setting + is controlled by the <property>hbase.master.distributed.log.splitting</property> + property, which can be set to <literal>true</literal> or <literal>false</literal>, + but defaults to <literal>true</literal>. </para> + </formalpara> + <procedure> + <title>Distributed Log Splitting, Step by Step</title> + <para>After configuring distributed log splitting, the HMaster controls the process. + The HMaster enrolls each RegionServer in the log splitting process, and the actual + work of splitting the logs is done by the RegionServers. The general process for + log splitting, as described in <xref + linkend="log.splitting.step.by.step" /> still applies here.</para> + <step> + <para>If distributed log processing is enabled, the HMaster creates a + <firstterm>split log manager</firstterm> instance when the cluster is started. + The split log manager manages all log files which need + to be scanned and split. The split log manager places all the logs into the + ZooKeeper splitlog node (<filename>/hbase/splitlog</filename>) as tasks. You can + view the contents of the splitlog by issuing the following + <command>zkcli</command> command. Example output is shown.</para> + <screen language="bourne">ls /hbase/splitlog +[hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost8.sample.com%2C57020%2C1340474893275-splitting%2Fhost8.sample.com%253A57020.1340474893900, +hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost3.sample.com%2C57020%2C1340474893299-splitting%2Fhost3.sample.com%253A57020.1340474893931, +hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost4.sample.com%2C57020%2C1340474893287-splitting%2Fhost4.sample.com%253A57020.1340474893946] + </screen> + <para>The output contains some non-ASCII characters. When decoded, it looks much + more simple:</para> + <screen> +[hdfs://host2.sample.com:56020/hbase/.logs +/host8.sample.com,57020,1340474893275-splitting +/host8.sample.com%3A57020.1340474893900, +hdfs://host2.sample.com:56020/hbase/.logs +/host3.sample.com,57020,1340474893299-splitting +/host3.sample.com%3A57020.1340474893931, +hdfs://host2.sample.com:56020/hbase/.logs +/host4.sample.com,57020,1340474893287-splitting +/host4.sample.com%3A57020.1340474893946] + </screen> + <para>The listing represents WAL file names to be scanned and split, which is a + list of log splitting tasks.</para> + </step> + <step> + <title>The split log manager monitors the log-splitting tasks and workers.</title> + <para>The split log manager is responsible for the following ongoing tasks:</para> + <itemizedlist> + <listitem> + <para>Once the split log manager publishes all the tasks to the splitlog + znode, it monitors these task nodes and waits for them to be + processed.</para> + </listitem> + <listitem> + <para>Checks to see if there are any dead split log + workers queued up. If it finds tasks claimed by unresponsive workers, it + will resubmit those tasks. If the resubmit fails due to some ZooKeeper + exception, the dead worker is queued up again for retry.</para> + </listitem> + <listitem> + <para>Checks to see if there are any unassigned + tasks. If it finds any, it create an ephemeral rescan node so that each + split log worker is notified to re-scan unassigned tasks via the + <code>nodeChildrenChanged</code> ZooKeeper event.</para> + </listitem> + <listitem> + <para>Checks for tasks which are assigned but expired. If any are found, they + are moved back to <code>TASK_UNASSIGNED</code> state again so that they can + be retried. It is possible that these tasks are assigned to slow workers, or + they may already be finished. This is not a problem, because log splitting + tasks have the property of idempotence. In other words, the same log + splitting task can be processed many times without causing any + problem.</para> + </listitem> + <listitem> + <para>The split log manager watches the HBase split log znodes constantly. If + any split log task node data is changed, the split log manager retrieves the + node data. The + node data contains the current state of the task. You can use the + <command>zkcli</command> <command>get</command> command to retrieve the + current state of a task. In the example output below, the first line of the + output shows that the task is currently unassigned.</para> + <screen> +<userinput>get /hbase/splitlog/hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost6.sample.com%2C57020%2C1340474893287-splitting%2Fhost6.sample.com%253A57020.1340474893945 +</userinput> +<computeroutput>unassigned host2.sample.com:57000 +cZxid = 0Ã7115 +ctime = Sat Jun 23 11:13:40 PDT 2012 +...</computeroutput> + </screen> + <para>Based on the state of the task whose data is changed, the split log + manager does one of the following:</para> + + <itemizedlist> + <listitem> + <para>Resubmit the task if it is unassigned</para> + </listitem> + <listitem> + <para>Heartbeat the task if it is assigned</para> + </listitem> + <listitem> + <para>Resubmit or fail the task if it is resigned (see <xref + linkend="distributed.log.replay.failure.reasons" />)</para> + </listitem> + <listitem> + <para>Resubmit or fail the task if it is completed with errors (see <xref + linkend="distributed.log.replay.failure.reasons" />)</para> + </listitem> + <listitem> + <para>Resubmit or fail the task if it could not complete due to + errors (see <xref + linkend="distributed.log.replay.failure.reasons" />)</para> + </listitem> + <listitem> + <para>Delete the task if it is successfully completed or failed</para> + </listitem> + </itemizedlist> + <itemizedlist xml:id="distributed.log.replay.failure.reasons"> + <title>Reasons a Task Will Fail</title> + <listitem><para>The task has been deleted.</para></listitem> + <listitem><para>The node no longer exists.</para></listitem> + <listitem><para>The log status manager failed to move the state of the task + to TASK_UNASSIGNED.</para></listitem> + <listitem><para>The number of resubmits is over the resubmit + threshold.</para></listitem> + </itemizedlist> + </listitem> + </itemizedlist> + </step> + <step> + <title>Each RegionServer's split log worker performs the log-splitting tasks.</title> + <para>Each RegionServer runs a daemon thread called the <firstterm>split log + worker</firstterm>, which does the work to split the logs. The daemon thread + starts when the RegionServer starts, and registers itself to watch HBase znodes. + If any splitlog znode children change, it notifies a sleeping worker thread to + wake up and grab more tasks. If if a worker's current taskâs node data is + changed, the worker checks to see if the task has been taken by another worker. + If so, the worker thread stops work on the current task.</para> + <para>The worker monitors + the splitlog znode constantly. When a new task appears, the split log worker + retrieves the task paths and checks each one until it finds an unclaimed task, + which it attempts to claim. If the claim was successful, it attempts to perform + the task and updates the task's <property>state</property> property based on the + splitting outcome. At this point, the split log worker scans for another + unclaimed task.</para> + <itemizedlist> + <title>How the Split Log Worker Approaches a Task</title> + + <listitem> + <para>It queries the task state and only takes action if the task is in + <literal>TASK_UNASSIGNED </literal>state.</para> + </listitem> + <listitem> + <para>If the task is is in <literal>TASK_UNASSIGNED</literal> state, the + worker attempts to set the state to <literal>TASK_OWNED</literal> by itself. + If it fails to set the state, another worker will try to grab it. The split + log manager will also ask all workers to rescan later if the task remains + unassigned.</para> + </listitem> + <listitem> + <para>If the worker succeeds in taking ownership of the task, it tries to get + the task state again to make sure it really gets it asynchronously. In the + meantime, it starts a split task executor to do the actual work: </para> + <itemizedlist> + <listitem> + <para>Get the HBase root folder, create a temp folder under the root, and + split the log file to the temp folder.</para> + </listitem> + <listitem> + <para>If the split was successful, the task executor sets the task to + state <literal>TASK_DONE</literal>.</para> + </listitem> + <listitem> + <para>If the worker catches an unexpected IOException, the task is set to + state <literal>TASK_ERR</literal>.</para> + </listitem> + <listitem> + <para>If the worker is shutting down, set the the task to state + <literal>TASK_RESIGNED</literal>.</para> + </listitem> + <listitem> + <para>If the task is taken by another worker, just log it.</para> + </listitem> + </itemizedlist> + </listitem> + </itemizedlist> + </step> + <step> + <title>The split log manager monitors for uncompleted tasks.</title> + <para>The split log manager returns when all tasks are completed successfully. If + all tasks are completed with some failures, the split log manager throws an + exception so that the log splitting can be retried. Due to an asynchronous + implementation, in very rare cases, the split log manager loses track of some + completed tasks. For that reason, it periodically checks for remaining + uncompleted task in its task map or ZooKeeper. If none are found, it throws an + exception so that the log splitting can be retried right away instead of hanging + there waiting for something that wonât happen.</para> + </step> + </procedure> + </section> + <section xml:id="distributed.log.replay"> + <title>Distributed Log Replay</title> + <para>After a RegionServer fails, its failed region is assigned to another + RegionServer, which is marked as "recovering" in ZooKeeper. A split log worker directly + replays edits from the WAL of the failed region server to the region at its new + location. When a region is in "recovering" state, it can accept writes but no reads + (including Append and Increment), region splits or merges. </para> + <para>Distributed Log Replay extends the <xref linkend="distributed.log.splitting" /> framework. It works by + directly replaying WAL edits to another RegionServer instead of creating + <filename>recovered.edits</filename> files. It provides the following advantages + over distributed log splitting alone:</para> + <itemizedlist> + <listitem><para>It eliminates the overhead of writing and reading a large number of + <filename>recovered.edits</filename> files. It is not unusual for thousands of + <filename>recovered.edits</filename> files to be created and written concurrently + during a RegionServer recovery. Many small random writes can degrade overall + system performance.</para></listitem> + <listitem><para>It allows writes even when a region is in recovering state. It only takes seconds for a recovering region to accept writes again. +</para></listitem> + </itemizedlist> + <formalpara> + <title>Enabling Distributed Log Replay</title> + <para>To enable distributed log replay, set <varname>hbase.master.distributed.log.replay</varname> to + true. This will be the default for HBase 0.99 (<link + xlink:href="https://issues.apache.org/jira/browse/HBASE-10888">HBASE-10888</link>).</para> + </formalpara> + <para>You must also enable HFile version 3 (which is the default HFile format starting + in HBase 0.99. See <link + xlink:href="https://issues.apache.org/jira/browse/HBASE-10855">HBASE-1085
<TRUNCATED>
