http://git-wip-us.apache.org/repos/asf/hbase/blob/e80b3092/src/main/docbkx/architecture.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/architecture.xml b/src/main/docbkx/architecture.xml deleted file mode 100644 index 16b298a..0000000 --- a/src/main/docbkx/architecture.xml +++ /dev/null @@ -1,3489 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<chapter - xml:id="architecture" - version="5.0" - xmlns="http://docbook.org/ns/docbook" - xmlns:xlink="http://www.w3.org/1999/xlink" - xmlns:xi="http://www.w3.org/2001/XInclude" - xmlns:svg="http://www.w3.org/2000/svg" - xmlns:m="http://www.w3.org/1998/Math/MathML" - xmlns:html="http://www.w3.org/1999/xhtml" - xmlns:db="http://docbook.org/ns/docbook"> - <!--/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ ---> - - <title>Architecture</title> - <section xml:id="arch.overview"> - <title>Overview</title> - <section xml:id="arch.overview.nosql"> - <title>NoSQL?</title> - <para>HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which - supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an - example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, - HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, - such as typed columns, secondary indexes, triggers, and advanced query languages, etc. - </para> - <para>However, HBase has many features which supports both linear and modular scaling. HBase clusters expand - by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 - RegionServers, for example, it doubles both in terms of storage and as well as processing capacity. - RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best - performance requires specialized hardware and storage devices. HBase features of note are: - <itemizedlist> - <listitem><para>Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This - makes it very suitable for tasks such as high-speed counter aggregation.</para> </listitem> - <listitem><para>Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are - automatically split and re-distributed as your data grows.</para></listitem> - <listitem><para>Automatic RegionServer failover</para></listitem> - <listitem><para>Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system.</para></listitem> - <listitem><para>MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both - source and sink.</para></listitem> - <listitem><para>Java Client API: HBase supports an easy to use Java API for programmatic access.</para></listitem> - <listitem><para>Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends.</para></listitem> - <listitem><para>Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.</para></listitem> - <listitem><para>Operational Management: HBase provides build-in web-pages for operational insight as well as JMX metrics.</para></listitem> - </itemizedlist> - </para> - </section> - - <section xml:id="arch.overview.when"> - <title>When Should I Use HBase?</title> - <para>HBase isn't suitable for every problem.</para> - <para>First, make sure you have enough data. If you have hundreds of millions or billions of rows, then - HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS - might be a better choice due to the fact that all of your data might wind up on a single node (or two) and - the rest of the cluster may be sitting idle. - </para> - <para>Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, - secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be - "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a - complete redesign as opposed to a port. - </para> - <para>Third, make sure you have enough hardware. Even HDFS doesn't do well with anything less than - 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode. - </para> - <para>HBase can run quite well stand-alone on a laptop - but this should be considered a development - configuration only. - </para> - </section> - <section xml:id="arch.overview.hbasehdfs"> - <title>What Is The Difference Between HBase and Hadoop/HDFS?</title> - <para><link xlink:href="http://hadoop.apache.org/hdfs/">HDFS</link> is a distributed file system that is well suited for the storage of large files. - Its documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. - HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. - This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist - on HDFS for high-speed lookups. See the <xref linkend="datamodel" /> and the rest of this chapter for more information on how HBase achieves its goals. - </para> - </section> - </section> - - <section - xml:id="arch.catalog"> - <title>Catalog Tables</title> - <para>The catalog table <code>hbase:meta</code> exists as an HBase table and is filtered out of the HBase - shell's <code>list</code> command, but is in fact a table just like any other. </para> - <section - xml:id="arch.catalog.root"> - <title>-ROOT-</title> - <note> - <para>The <code>-ROOT-</code> table was removed in HBase 0.96.0. Information here should - be considered historical.</para> - </note> - <para>The <code>-ROOT-</code> table kept track of the location of the - <code>.META</code> table (the previous name for the table now called <code>hbase:meta</code>) prior to HBase - 0.96. The <code>-ROOT-</code> table structure was as follows: </para> - <itemizedlist> - <title>Key</title> - <listitem> - <para>.META. region key (<code>.META.,,1</code>)</para> - </listitem> - </itemizedlist> - - <itemizedlist> - <title>Values</title> - <listitem> - <para><code>info:regioninfo</code> (serialized <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html">HRegionInfo</link> - instance of hbase:meta)</para> - </listitem> - <listitem> - <para><code>info:server</code> (server:port of the RegionServer holding - hbase:meta)</para> - </listitem> - <listitem> - <para><code>info:serverstartcode</code> (start-time of the RegionServer process holding - hbase:meta)</para> - </listitem> - </itemizedlist> - </section> - <section - xml:id="arch.catalog.meta"> - <title>hbase:meta</title> - <para>The <code>hbase:meta</code> table (previously called <code>.META.</code>) keeps a list - of all regions in the system. The location of <code>hbase:meta</code> was previously - tracked within the <code>-ROOT-</code> table, but is now stored in Zookeeper.</para> - <para>The <code>hbase:meta</code> table structure is as follows: </para> - <itemizedlist> - <title>Key</title> - <listitem> - <para>Region key of the format (<code>[table],[region start key],[region - id]</code>)</para> - </listitem> - </itemizedlist> - <itemizedlist> - <title>Values</title> - <listitem> - <para><code>info:regioninfo</code> (serialized <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html"> - HRegionInfo</link> instance for this region)</para> - </listitem> - <listitem> - <para><code>info:server</code> (server:port of the RegionServer containing this - region)</para> - </listitem> - <listitem> - <para><code>info:serverstartcode</code> (start-time of the RegionServer process - containing this region)</para> - </listitem> - </itemizedlist> - <para>When a table is in the process of splitting, two other columns will be created, called - <code>info:splitA</code> and <code>info:splitB</code>. These columns represent the two - daughter regions. The values for these columns are also serialized HRegionInfo instances. - After the region has been split, eventually this row will be deleted. </para> - <note> - <title>Note on HRegionInfo</title> - <para>The empty key is used to denote table start and table end. A region with an empty - start key is the first region in a table. If a region has both an empty start and an - empty end key, it is the only region in the table </para> - </note> - <para>In the (hopefully unlikely) event that programmatic processing of catalog metadata is - required, see the <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Writables.html#getHRegionInfo%28byte[]%29">Writables</link> - utility. </para> - </section> - <section - xml:id="arch.catalog.startup"> - <title>Startup Sequencing</title> - <para>First, the location of <code>hbase:meta</code> is looked up in Zookeeper. Next, - <code>hbase:meta</code> is updated with server and startcode values.</para> - <para>For information on region-RegionServer assignment, see <xref - linkend="regions.arch.assignment" />. </para> - </section> - </section> <!-- catalog --> - - <section - xml:id="client"> - <title>Client</title> - <para>The HBase client finds the RegionServers that are serving the particular row range of - interest. It does this by querying the <code>hbase:meta</code> table. See <xref - linkend="arch.catalog.meta" /> for details. After locating the required region(s), the - client contacts the RegionServer serving that region, rather than going through the master, - and issues the read or write request. This information is cached in the client so that - subsequent requests need not go through the lookup process. Should a region be reassigned - either by the master load balancer or because a RegionServer has died, the client will - requery the catalog tables to determine the new location of the user region. </para> - - <para>See <xref - linkend="master.runtime" /> for more information about the impact of the Master on HBase - Client communication. </para> - <para>Administrative functions are done via an instance of <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html">Admin</link> - </para> - - <section - xml:id="client.connections"> - <title>Cluster Connections</title> - <para>The API changed in HBase 1.0. Its been cleaned up and users are returned - Interfaces to work against rather than particular types. In HBase 1.0, - obtain a cluster Connection from ConnectionFactory and thereafter, get from it - instances of Table, Admin, and RegionLocator on an as-need basis. When done, close - obtained instances. Finally, be sure to cleanup your Connection instance before - exiting. Connections are heavyweight objects. Create once and keep an instance around. - Table, Admin and RegionLocator instances are lightweight. Create as you go and then - let go as soon as you are done by closing them. See the - <link xlink:href="/Users/stack/checkouts/hbase.git/target/site/apidocs/org/apache/hadoop/hbase/client/package-summary.html">Client Package Javadoc Description</link> for example usage of the new HBase 1.0 API.</para> - - <para>For connection configuration information, see <xref linkend="client_dependencies" />. </para> - - <para><emphasis><link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html">Table</link> - instances are not thread-safe</emphasis>. Only one thread can use an instance of Table at - any given time. When creating Table instances, it is advisable to use the same <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HBaseConfiguration">HBaseConfiguration</link> - instance. This will ensure sharing of ZooKeeper and socket instances to the RegionServers - which is usually what you want. For example, this is preferred:</para> - <programlisting language="java">HBaseConfiguration conf = HBaseConfiguration.create(); -HTable table1 = new HTable(conf, "myTable"); -HTable table2 = new HTable(conf, "myTable");</programlisting> - <para>as opposed to this:</para> - <programlisting language="java">HBaseConfiguration conf1 = HBaseConfiguration.create(); -HTable table1 = new HTable(conf1, "myTable"); -HBaseConfiguration conf2 = HBaseConfiguration.create(); -HTable table2 = new HTable(conf2, "myTable");</programlisting> - - <para>For more information about how connections are handled in the HBase client, - see <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html">HConnectionManager</link>. - </para> - <section xml:id="client.connection.pooling"><title>Connection Pooling</title> - <para>For applications which require high-end multithreaded access (e.g., web-servers or application servers that may serve many application threads - in a single JVM), you can pre-create an <classname>HConnection</classname>, as shown in - the following example:</para> - <example> - <title>Pre-Creating a <code>HConnection</code></title> - <programlisting language="java">// Create a connection to the cluster. -HConnection connection = HConnectionManager.createConnection(Configuration); -HTableInterface table = connection.getTable("myTable"); -// use table as needed, the table returned is lightweight -table.close(); -// use the connection for other access to the cluster -connection.close();</programlisting> - </example> - <para>Constructing HTableInterface implementation is very lightweight and resources are - controlled.</para> - <warning> - <title><code>HTablePool</code> is Deprecated</title> - <para>Previous versions of this guide discussed <code>HTablePool</code>, which was - deprecated in HBase 0.94, 0.95, and 0.96, and removed in 0.98.1, by <link - xlink:href="https://issues.apache.org/jira/browse/HBASE-6580">HBASE-6500</link>. - Please use <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnection.html"><code>HConnection</code></link> instead.</para> - </warning> - </section> - </section> - <section xml:id="client.writebuffer"><title>WriteBuffer and Batch Methods</title> - <para>If <xref linkend="perf.hbase.client.autoflush" /> is turned off on - <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html">HTable</link>, - <classname>Put</classname>s are sent to RegionServers when the writebuffer - is filled. The writebuffer is 2MB by default. Before an HTable instance is - discarded, either <methodname>close()</methodname> or - <methodname>flushCommits()</methodname> should be invoked so Puts - will not be lost. - </para> - <para>Note: <code>htable.delete(Delete);</code> does not go in the writebuffer! This only applies to Puts. - </para> - <para>For additional information on write durability, review the <link xlink:href="../acid-semantics.html">ACID semantics</link> page. - </para> - <para>For fine-grained control of batching of - <classname>Put</classname>s or <classname>Delete</classname>s, - see the <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#batch%28java.util.List%29">batch</link> methods on HTable. - </para> - </section> - <section xml:id="client.external"><title>External Clients</title> - <para>Information on non-Java clients and custom protocols is covered in <xref linkend="external_apis" /> - </para> - </section> - </section> - - <section xml:id="client.filter"><title>Client Request Filters</title> - <para><link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html">Get</link> and <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link> instances can be - optionally configured with <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.html">filters</link> which are applied on the RegionServer. - </para> - <para>Filters can be confusing because there are many different types, and it is best to approach them by understanding the groups - of Filter functionality. - </para> - <section xml:id="client.filter.structural"><title>Structural</title> - <para>Structural Filters contain other Filters.</para> - <section xml:id="client.filter.structural.fl"><title>FilterList</title> - <para><link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterList.html">FilterList</link> - represents a list of Filters with a relationship of <code>FilterList.Operator.MUST_PASS_ALL</code> or - <code>FilterList.Operator.MUST_PASS_ONE</code> between the Filters. The following example shows an 'or' between two - Filters (checking for either 'my value' or 'my other value' on the same attribute).</para> -<programlisting language="java"> -FilterList list = new FilterList(FilterList.Operator.MUST_PASS_ONE); -SingleColumnValueFilter filter1 = new SingleColumnValueFilter( - cf, - column, - CompareOp.EQUAL, - Bytes.toBytes("my value") - ); -list.add(filter1); -SingleColumnValueFilter filter2 = new SingleColumnValueFilter( - cf, - column, - CompareOp.EQUAL, - Bytes.toBytes("my other value") - ); -list.add(filter2); -scan.setFilter(list); -</programlisting> - </section> - </section> - <section - xml:id="client.filter.cv"> - <title>Column Value</title> - <section - xml:id="client.filter.cv.scvf"> - <title>SingleColumnValueFilter</title> - <para><link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.html">SingleColumnValueFilter</link> - can be used to test column values for equivalence (<code><link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/CompareFilter.CompareOp.html">CompareOp.EQUAL</link> - </code>), inequality (<code>CompareOp.NOT_EQUAL</code>), or ranges (e.g., - <code>CompareOp.GREATER</code>). The following is example of testing equivalence a - column to a String value "my value"...</para> - <programlisting language="java"> -SingleColumnValueFilter filter = new SingleColumnValueFilter( - cf, - column, - CompareOp.EQUAL, - Bytes.toBytes("my value") - ); -scan.setFilter(filter); -</programlisting> - </section> - </section> - <section - xml:id="client.filter.cvp"> - <title>Column Value Comparators</title> - <para>There are several Comparator classes in the Filter package that deserve special - mention. These Comparators are used in concert with other Filters, such as <xref - linkend="client.filter.cv.scvf" />. </para> - <section - xml:id="client.filter.cvp.rcs"> - <title>RegexStringComparator</title> - <para><link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/RegexStringComparator.html">RegexStringComparator</link> - supports regular expressions for value comparisons.</para> - <programlisting language="java"> -RegexStringComparator comp = new RegexStringComparator("my."); // any value that starts with 'my' -SingleColumnValueFilter filter = new SingleColumnValueFilter( - cf, - column, - CompareOp.EQUAL, - comp - ); -scan.setFilter(filter); -</programlisting> - <para>See the Oracle JavaDoc for <link - xlink:href="http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html">supported - RegEx patterns in Java</link>. </para> - </section> - <section - xml:id="client.filter.cvp.SubStringComparator"> - <title>SubstringComparator</title> - <para><link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SubstringComparator.html">SubstringComparator</link> - can be used to determine if a given substring exists in a value. The comparison is - case-insensitive. </para> - <programlisting language="java"> -SubstringComparator comp = new SubstringComparator("y val"); // looking for 'my value' -SingleColumnValueFilter filter = new SingleColumnValueFilter( - cf, - column, - CompareOp.EQUAL, - comp - ); -scan.setFilter(filter); -</programlisting> - </section> - <section - xml:id="client.filter.cvp.bfp"> - <title>BinaryPrefixComparator</title> - <para>See <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/BinaryPrefixComparator.html">BinaryPrefixComparator</link>.</para> - </section> - <section - xml:id="client.filter.cvp.bc"> - <title>BinaryComparator</title> - <para>See <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/BinaryComparator.html">BinaryComparator</link>.</para> - </section> - </section> - <section - xml:id="client.filter.kvm"> - <title>KeyValue Metadata</title> - <para>As HBase stores data internally as KeyValue pairs, KeyValue Metadata Filters evaluate - the existence of keys (i.e., ColumnFamily:Column qualifiers) for a row, as opposed to - values the previous section. </para> - <section - xml:id="client.filter.kvm.ff"> - <title>FamilyFilter</title> - <para><link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FamilyFilter.html">FamilyFilter</link> - can be used to filter on the ColumnFamily. It is generally a better idea to select - ColumnFamilies in the Scan than to do it with a Filter.</para> - </section> - <section - xml:id="client.filter.kvm.qf"> - <title>QualifierFilter</title> - <para><link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/QualifierFilter.html">QualifierFilter</link> - can be used to filter based on Column (aka Qualifier) name. </para> - </section> - <section - xml:id="client.filter.kvm.cpf"> - <title>ColumnPrefixFilter</title> - <para><link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnPrefixFilter.html">ColumnPrefixFilter</link> - can be used to filter based on the lead portion of Column (aka Qualifier) names. </para> - <para>A ColumnPrefixFilter seeks ahead to the first column matching the prefix in each row - and for each involved column family. It can be used to efficiently get a subset of the - columns in very wide rows. </para> - <para>Note: The same column qualifier can be used in different column families. This - filter returns all matching columns. </para> - <para>Example: Find all columns in a row and family that start with "abc"</para> - <programlisting language="java"> -HTableInterface t = ...; -byte[] row = ...; -byte[] family = ...; -byte[] prefix = Bytes.toBytes("abc"); -Scan scan = new Scan(row, row); // (optional) limit to one row -scan.addFamily(family); // (optional) limit to one family -Filter f = new ColumnPrefixFilter(prefix); -scan.setFilter(f); -scan.setBatch(10); // set this if there could be many columns returned -ResultScanner rs = t.getScanner(scan); -for (Result r = rs.next(); r != null; r = rs.next()) { - for (KeyValue kv : r.raw()) { - // each kv represents a column - } -} -rs.close(); -</programlisting> - </section> - <section - xml:id="client.filter.kvm.mcpf"> - <title>MultipleColumnPrefixFilter</title> - <para><link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/MultipleColumnPrefixFilter.html">MultipleColumnPrefixFilter</link> - behaves like ColumnPrefixFilter but allows specifying multiple prefixes. </para> - <para>Like ColumnPrefixFilter, MultipleColumnPrefixFilter efficiently seeks ahead to the - first column matching the lowest prefix and also seeks past ranges of columns between - prefixes. It can be used to efficiently get discontinuous sets of columns from very wide - rows. </para> - <para>Example: Find all columns in a row and family that start with "abc" or "xyz"</para> - <programlisting language="java"> -HTableInterface t = ...; -byte[] row = ...; -byte[] family = ...; -byte[][] prefixes = new byte[][] {Bytes.toBytes("abc"), Bytes.toBytes("xyz")}; -Scan scan = new Scan(row, row); // (optional) limit to one row -scan.addFamily(family); // (optional) limit to one family -Filter f = new MultipleColumnPrefixFilter(prefixes); -scan.setFilter(f); -scan.setBatch(10); // set this if there could be many columns returned -ResultScanner rs = t.getScanner(scan); -for (Result r = rs.next(); r != null; r = rs.next()) { - for (KeyValue kv : r.raw()) { - // each kv represents a column - } -} -rs.close(); -</programlisting> - </section> - <section - xml:id="client.filter.kvm.crf "> - <title>ColumnRangeFilter</title> - <para>A <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnRangeFilter.html">ColumnRangeFilter</link> - allows efficient intra row scanning. </para> - <para>A ColumnRangeFilter can seek ahead to the first matching column for each involved - column family. It can be used to efficiently get a 'slice' of the columns of a very wide - row. i.e. you have a million columns in a row but you only want to look at columns - bbbb-bbdd. </para> - <para>Note: The same column qualifier can be used in different column families. This - filter returns all matching columns. </para> - <para>Example: Find all columns in a row and family between "bbbb" (inclusive) and "bbdd" - (inclusive)</para> - <programlisting language="java"> -HTableInterface t = ...; -byte[] row = ...; -byte[] family = ...; -byte[] startColumn = Bytes.toBytes("bbbb"); -byte[] endColumn = Bytes.toBytes("bbdd"); -Scan scan = new Scan(row, row); // (optional) limit to one row -scan.addFamily(family); // (optional) limit to one family -Filter f = new ColumnRangeFilter(startColumn, true, endColumn, true); -scan.setFilter(f); -scan.setBatch(10); // set this if there could be many columns returned -ResultScanner rs = t.getScanner(scan); -for (Result r = rs.next(); r != null; r = rs.next()) { - for (KeyValue kv : r.raw()) { - // each kv represents a column - } -} -rs.close(); -</programlisting> - <para>Note: Introduced in HBase 0.92</para> - </section> - </section> - <section xml:id="client.filter.row"><title>RowKey</title> - <section xml:id="client.filter.row.rf"><title>RowFilter</title> - <para>It is generally a better idea to use the startRow/stopRow methods on Scan for row selection, however - <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/RowFilter.html">RowFilter</link> can also be used.</para> - </section> - </section> - <section xml:id="client.filter.utility"><title>Utility</title> - <section xml:id="client.filter.utility.fkof"><title>FirstKeyOnlyFilter</title> - <para>This is primarily used for rowcount jobs. - See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FirstKeyOnlyFilter.html">FirstKeyOnlyFilter</link>.</para> - </section> - </section> - </section> <!-- client.filter --> - - <section xml:id="master"><title>Master</title> - <para><code>HMaster</code> is the implementation of the Master Server. The Master server is - responsible for monitoring all RegionServer instances in the cluster, and is the interface - for all metadata changes. In a distributed cluster, the Master typically runs on the <xref - linkend="arch.hdfs.nn"/>. J Mohamed Zahoor goes into some more detail on the Master - Architecture in this blog posting, <link - xlink:href="http://blog.zahoor.in/2012/08/hbase-hmaster-architecture/">HBase HMaster - Architecture </link>.</para> - <section xml:id="master.startup"><title>Startup Behavior</title> - <para>If run in a multi-Master environment, all Masters compete to run the cluster. If the active - Master loses its lease in ZooKeeper (or the Master shuts down), then then the remaining Masters jostle to - take over the Master role. - </para> - </section> - <section - xml:id="master.runtime"> - <title>Runtime Impact</title> - <para>A common dist-list question involves what happens to an HBase cluster when the Master - goes down. Because the HBase client talks directly to the RegionServers, the cluster can - still function in a "steady state." Additionally, per <xref - linkend="arch.catalog" />, <code>hbase:meta</code> exists as an HBase table and is not - resident in the Master. However, the Master controls critical functions such as - RegionServer failover and completing region splits. So while the cluster can still run for - a short time without the Master, the Master should be restarted as soon as possible. - </para> - </section> - <section xml:id="master.api"><title>Interface</title> - <para>The methods exposed by <code>HMasterInterface</code> are primarily metadata-oriented methods: - <itemizedlist> - <listitem><para>Table (createTable, modifyTable, removeTable, enable, disable) - </para></listitem> - <listitem><para>ColumnFamily (addColumn, modifyColumn, removeColumn) - </para></listitem> - <listitem><para>Region (move, assign, unassign) - </para></listitem> - </itemizedlist> - For example, when the <code>HBaseAdmin</code> method <code>disableTable</code> is invoked, it is serviced by the Master server. - </para> - </section> - <section xml:id="master.processes"><title>Processes</title> - <para>The Master runs several background threads: - </para> - <section xml:id="master.processes.loadbalancer"><title>LoadBalancer</title> - <para>Periodically, and when there are no regions in transition, - a load balancer will run and move regions around to balance the cluster's load. - See <xref linkend="balancer_config" /> for configuring this property.</para> - <para>See <xref linkend="regions.arch.assignment"/> for more information on region assignment. - </para> - </section> - <section xml:id="master.processes.catalog"><title>CatalogJanitor</title> - <para>Periodically checks and cleans up the hbase:meta table. See <xref linkend="arch.catalog.meta" /> for more information on META.</para> - </section> - </section> - - </section> - <section - xml:id="regionserver.arch"> - <title>RegionServer</title> - <para><code>HRegionServer</code> is the RegionServer implementation. It is responsible for - serving and managing regions. In a distributed cluster, a RegionServer runs on a <xref - linkend="arch.hdfs.dn" />. </para> - <section - xml:id="regionserver.arch.api"> - <title>Interface</title> - <para>The methods exposed by <code>HRegionRegionInterface</code> contain both data-oriented - and region-maintenance methods: <itemizedlist> - <listitem> - <para>Data (get, put, delete, next, etc.)</para> - </listitem> - <listitem> - <para>Region (splitRegion, compactRegion, etc.)</para> - </listitem> - </itemizedlist> For example, when the <code>HBaseAdmin</code> method - <code>majorCompact</code> is invoked on a table, the client is actually iterating - through all regions for the specified table and requesting a major compaction directly to - each region. </para> - </section> - <section - xml:id="regionserver.arch.processes"> - <title>Processes</title> - <para>The RegionServer runs a variety of background threads:</para> - <section - xml:id="regionserver.arch.processes.compactsplit"> - <title>CompactSplitThread</title> - <para>Checks for splits and handle minor compactions.</para> - </section> - <section - xml:id="regionserver.arch.processes.majorcompact"> - <title>MajorCompactionChecker</title> - <para>Checks for major compactions.</para> - </section> - <section - xml:id="regionserver.arch.processes.memstore"> - <title>MemStoreFlusher</title> - <para>Periodically flushes in-memory writes in the MemStore to StoreFiles.</para> - </section> - <section - xml:id="regionserver.arch.processes.log"> - <title>LogRoller</title> - <para>Periodically checks the RegionServer's WAL.</para> - </section> - </section> - - <section - xml:id="coprocessors"> - <title>Coprocessors</title> - <para>Coprocessors were added in 0.92. There is a thorough <link - xlink:href="https://blogs.apache.org/hbase/entry/coprocessor_introduction">Blog Overview - of CoProcessors</link> posted. Documentation will eventually move to this reference - guide, but the blog is the most current information available at this time. </para> - </section> - - <section - xml:id="block.cache"> - <title>Block Cache</title> - - <para>HBase provides two different BlockCache implementations: the default onheap - LruBlockCache and BucketCache, which is (usually) offheap. This section - discusses benefits and drawbacks of each implementation, how to choose the appropriate - option, and configuration options for each.</para> - - <note><title>Block Cache Reporting: UI</title> - <para>See the RegionServer UI for detail on caching deploy. Since HBase-0.98.4, the - Block Cache detail has been significantly extended showing configurations, - sizings, current usage, time-in-the-cache, and even detail on block counts and types.</para> - </note> - - <section> - - <title>Cache Choices</title> - <para><classname>LruBlockCache</classname> is the original implementation, and is - entirely within the Java heap. <classname>BucketCache</classname> is mainly - intended for keeping blockcache data offheap, although BucketCache can also - keep data onheap and serve from a file-backed cache. - <note><title>BucketCache is production ready as of hbase-0.98.6</title> - <para>To run with BucketCache, you need HBASE-11678. This was included in - hbase-0.98.6. - </para> - </note> - </para> - - <para>Fetching will always be slower when fetching from BucketCache, - as compared to the native onheap LruBlockCache. However, latencies tend to be - less erratic across time, because there is less garbage collection when you use - BucketCache since it is managing BlockCache allocations, not the GC. If the - BucketCache is deployed in offheap mode, this memory is not managed by the - GC at all. This is why you'd use BucketCache, so your latencies are less erratic and to mitigate GCs - and heap fragmentation. See Nick Dimiduk's <link - xlink:href="http://www.n10k.com/blog/blockcache-101/">BlockCache 101</link> for - comparisons running onheap vs offheap tests. Also see - <link xlink:href="http://people.apache.org/~stack/bc/">Comparing BlockCache Deploys</link> - which finds that if your dataset fits inside your LruBlockCache deploy, use it otherwise - if you are experiencing cache churn (or you want your cache to exist beyond the - vagaries of java GC), use BucketCache. - </para> - - <para>When you enable BucketCache, you are enabling a two tier caching - system, an L1 cache which is implemented by an instance of LruBlockCache and - an offheap L2 cache which is implemented by BucketCache. Management of these - two tiers and the policy that dictates how blocks move between them is done by - <classname>CombinedBlockCache</classname>. It keeps all DATA blocks in the L2 - BucketCache and meta blocks -- INDEX and BLOOM blocks -- - onheap in the L1 <classname>LruBlockCache</classname>. - See <xref linkend="offheap.blockcache" /> for more detail on going offheap.</para> - </section> - - <section xml:id="cache.configurations"> - <title>General Cache Configurations</title> - <para>Apart from the cache implementation itself, you can set some general configuration - options to control how the cache performs. See <link - xlink:href="http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html" - />. After setting any of these options, restart or rolling restart your cluster for the - configuration to take effect. Check logs for errors or unexpected behavior.</para> - <para>See also <xref linkend="blockcache.prefetch"/>, which discusses a new option - introduced in <link xlink:href="https://issues.apache.org/jira/browse/HBASE-9857" - >HBASE-9857</link>.</para> - </section> - - <section - xml:id="block.cache.design"> - <title>LruBlockCache Design</title> - <para>The LruBlockCache is an LRU cache that contains three levels of block priority to - allow for scan-resistance and in-memory ColumnFamilies: </para> - <itemizedlist> - <listitem> - <para>Single access priority: The first time a block is loaded from HDFS it normally - has this priority and it will be part of the first group to be considered during - evictions. The advantage is that scanned blocks are more likely to get evicted than - blocks that are getting more usage.</para> - </listitem> - <listitem> - <para>Mutli access priority: If a block in the previous priority group is accessed - again, it upgrades to this priority. It is thus part of the second group considered - during evictions.</para> - </listitem> - <listitem xml:id="hbase.cache.inmemory"> - <para>In-memory access priority: If the block's family was configured to be - "in-memory", it will be part of this priority disregarding the number of times it - was accessed. Catalog tables are configured like this. This group is the last one - considered during evictions.</para> - <para>To mark a column family as in-memory, call - <programlisting language="java">HColumnDescriptor.setInMemory(true);</programlisting> if creating a table from java, - or set <command>IN_MEMORY => true</command> when creating or altering a table in - the shell: e.g. <programlisting>hbase(main):003:0> create 't', {NAME => 'f', IN_MEMORY => 'true'}</programlisting></para> - </listitem> - </itemizedlist> - <para> For more information, see the <link - xlink:href="http://hbase.apache.org/xref/org/apache/hadoop/hbase/io/hfile/LruBlockCache.html">LruBlockCache - source</link> - </para> - </section> - <section - xml:id="block.cache.usage"> - <title>LruBlockCache Usage</title> - <para>Block caching is enabled by default for all the user tables which means that any - read operation will load the LRU cache. This might be good for a large number of use - cases, but further tunings are usually required in order to achieve better performance. - An important concept is the <link - xlink:href="http://en.wikipedia.org/wiki/Working_set_size">working set size</link>, or - WSS, which is: "the amount of memory needed to compute the answer to a problem". For a - website, this would be the data that's needed to answer the queries over a short amount - of time. </para> - <para>The way to calculate how much memory is available in HBase for caching is: </para> - <programlisting> - number of region servers * heap size * hfile.block.cache.size * 0.99 - </programlisting> - <para>The default value for the block cache is 0.25 which represents 25% of the available - heap. The last value (99%) is the default acceptable loading factor in the LRU cache - after which eviction is started. The reason it is included in this equation is that it - would be unrealistic to say that it is possible to use 100% of the available memory - since this would make the process blocking from the point where it loads new blocks. - Here are some examples: </para> - <itemizedlist> - <listitem> - <para>One region server with the default heap size (1 GB) and the default block cache - size will have 253 MB of block cache available.</para> - </listitem> - <listitem> - <para>20 region servers with the heap size set to 8 GB and a default block cache size - will have 39.6 of block cache.</para> - </listitem> - <listitem> - <para>100 region servers with the heap size set to 24 GB and a block cache size of 0.5 - will have about 1.16 TB of block cache.</para> - </listitem> - </itemizedlist> - <para>Your data is not the only resident of the block cache. Here are others that you may have to take into account: - </para> - <variablelist> - <varlistentry> - <term>Catalog Tables</term> - <listitem> - <para>The <code>-ROOT-</code> (prior to HBase 0.96. See <xref - linkend="arch.catalog.root" />) and <code>hbase:meta</code> tables are forced - into the block cache and have the in-memory priority which means that they are - harder to evict. The former never uses more than a few hundreds of bytes while the - latter can occupy a few MBs (depending on the number of regions).</para> - </listitem> - </varlistentry> - <varlistentry> - <term>HFiles Indexes</term> - <listitem> - <para>An <firstterm>hfile</firstterm> is the file format that HBase uses to store - data in HDFS. It contains a multi-layered index which allows HBase to seek to the - data without having to read the whole file. The size of those indexes is a factor - of the block size (64KB by default), the size of your keys and the amount of data - you are storing. For big data sets it's not unusual to see numbers around 1GB per - region server, although not all of it will be in cache because the LRU will evict - indexes that aren't used.</para> - </listitem> - </varlistentry> - <varlistentry> - <term>Keys</term> - <listitem> - <para>The values that are stored are only half the picture, since each value is - stored along with its keys (row key, family qualifier, and timestamp). See <xref - linkend="keysize" />.</para> - </listitem> - </varlistentry> - <varlistentry> - <term>Bloom Filters</term> - <listitem> - <para>Just like the HFile indexes, those data structures (when enabled) are stored - in the LRU.</para> - </listitem> - </varlistentry> - </variablelist> - <para>Currently the recommended way to measure HFile indexes and bloom filters sizes is to - look at the region server web UI and checkout the relevant metrics. For keys, sampling - can be done by using the HFile command line tool and look for the average key size - metric. Since HBase 0.98.3, you can view detail on BlockCache stats and metrics - in a special Block Cache section in the UI.</para> - <para>It's generally bad to use block caching when the WSS doesn't fit in memory. This is - the case when you have for example 40GB available across all your region servers' block - caches but you need to process 1TB of data. One of the reasons is that the churn - generated by the evictions will trigger more garbage collections unnecessarily. Here are - two use cases: </para> - <itemizedlist> - <listitem> - <para>Fully random reading pattern: This is a case where you almost never access the - same row twice within a short amount of time such that the chance of hitting a - cached block is close to 0. Setting block caching on such a table is a waste of - memory and CPU cycles, more so that it will generate more garbage to pick up by the - JVM. For more information on monitoring GC, see <xref - linkend="trouble.log.gc" />.</para> - </listitem> - <listitem> - <para>Mapping a table: In a typical MapReduce job that takes a table in input, every - row will be read only once so there's no need to put them into the block cache. The - Scan object has the option of turning this off via the setCaching method (set it to - false). You can still keep block caching turned on on this table if you need fast - random read access. An example would be counting the number of rows in a table that - serves live traffic, caching every block of that table would create massive churn - and would surely evict data that's currently in use. </para> - </listitem> - </itemizedlist> - <section xml:id="data.blocks.in.fscache"> - <title>Caching META blocks only (DATA blocks in fscache)</title> - <para>An interesting setup is one where we cache META blocks only and we read DATA - blocks in on each access. If the DATA blocks fit inside fscache, this alternative - may make sense when access is completely random across a very large dataset. - To enable this setup, alter your table and for each column family - set <varname>BLOCKCACHE => 'false'</varname>. You are 'disabling' the - BlockCache for this column family only you can never disable the caching of - META blocks. Since - <link xlink:href="https://issues.apache.org/jira/browse/HBASE-4683">HBASE-4683 Always cache index and bloom blocks</link>, - we will cache META blocks even if the BlockCache is disabled. - </para> - </section> - </section> - <section - xml:id="offheap.blockcache"> - <title>Offheap Block Cache</title> - <section xml:id="enable.bucketcache"> - <title>How to Enable BucketCache</title> - <para>The usual deploy of BucketCache is via a managing class that sets up two caching tiers: an L1 onheap cache - implemented by LruBlockCache and a second L2 cache implemented with BucketCache. The managing class is <link - xlink:href="http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CombinedBlockCache.html">CombinedBlockCache</link> by default. - The just-previous link describes the caching 'policy' implemented by CombinedBlockCache. In short, it works - by keeping meta blocks -- INDEX and BLOOM in the L1, onheap LruBlockCache tier -- and DATA - blocks are kept in the L2, BucketCache tier. It is possible to amend this behavior in - HBase since version 1.0 and ask that a column family have both its meta and DATA blocks hosted onheap in the L1 tier by - setting <varname>cacheDataInL1</varname> via - <code>(HColumnDescriptor.setCacheDataInL1(true)</code> - or in the shell, creating or amending column families setting <varname>CACHE_DATA_IN_L1</varname> - to true: e.g. <programlisting>hbase(main):003:0> create 't', {NAME => 't', CONFIGURATION => {CACHE_DATA_IN_L1 => 'true'}}</programlisting></para> - - <para>The BucketCache Block Cache can be deployed onheap, offheap, or file based. - You set which via the - <varname>hbase.bucketcache.ioengine</varname> setting. Setting it to - <varname>heap</varname> will have BucketCache deployed inside the - allocated java heap. Setting it to <varname>offheap</varname> will have - BucketCache make its allocations offheap, - and an ioengine setting of <varname>file:PATH_TO_FILE</varname> will direct - BucketCache to use a file caching (Useful in particular if you have some fast i/o attached to the box such - as SSDs). - </para> - <para xml:id="raw.l1.l2">It is possible to deploy an L1+L2 setup where we bypass the CombinedBlockCache - policy and have BucketCache working as a strict L2 cache to the L1 - LruBlockCache. For such a setup, set <varname>CacheConfig.BUCKET_CACHE_COMBINED_KEY</varname> to - <literal>false</literal>. In this mode, on eviction from L1, blocks go to L2. - When a block is cached, it is cached first in L1. When we go to look for a cached block, - we look first in L1 and if none found, then search L2. Let us call this deploy format, - <emphasis><indexterm><primary>Raw L1+L2</primary></indexterm></emphasis>.</para> - <para>Other BucketCache configs include: specifying a location to persist cache to across - restarts, how many threads to use writing the cache, etc. See the - <link xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html">CacheConfig.html</link> - class for configuration options and descriptions.</para> - - <procedure> - <title>BucketCache Example Configuration</title> - <para>This sample provides a configuration for a 4 GB offheap BucketCache with a 1 GB - onheap cache. Configuration is performed on the RegionServer. Setting - <varname>hbase.bucketcache.ioengine</varname> and - <varname>hbase.bucketcache.size</varname> > 0 enables CombinedBlockCache. - Let us presume that the RegionServer has been set to run with a 5G heap: - i.e. HBASE_HEAPSIZE=5g. - </para> - <step> - <para>First, edit the RegionServer's <filename>hbase-env.sh</filename> and set - <varname>HBASE_OFFHEAPSIZE</varname> to a value greater than the offheap size wanted, in - this case, 4 GB (expressed as 4G). Lets set it to 5G. That'll be 4G - for our offheap cache and 1G for any other uses of offheap memory (there are - other users of offheap memory other than BlockCache; e.g. DFSClient - in RegionServer can make use of offheap memory). See <xref linkend="direct.memory" />.</para> - <programlisting>HBASE_OFFHEAPSIZE=5G</programlisting> - </step> - <step> - <para>Next, add the following configuration to the RegionServer's - <filename>hbase-site.xml</filename>.</para> - <programlisting language="xml"> -<![CDATA[<property> - <name>hbase.bucketcache.ioengine</name> - <value>offheap</value> -</property> -<property> - <name>hfile.block.cache.size</name> - <value>0.2</value> -</property> -<property> - <name>hbase.bucketcache.size</name> - <value>4196</value> -</property>]]> - </programlisting> - </step> - <step> - <para>Restart or rolling restart your cluster, and check the logs for any - issues.</para> - </step> - </procedure> - <para>In the above, we set bucketcache to be 4G. The onheap lrublockcache we - configured to have 0.2 of the RegionServer's heap size (0.2 * 5G = 1G). - In other words, you configure the L1 LruBlockCache as you would normally, - as you would when there is no L2 BucketCache present. - </para> - <para><link xlink:href="https://issues.apache.org/jira/browse/HBASE-10641" - >HBASE-10641</link> introduced the ability to configure multiple sizes for the - buckets of the bucketcache, in HBase 0.98 and newer. To configurable multiple bucket - sizes, configure the new property <option>hfile.block.cache.sizes</option> (instead of - <option>hfile.block.cache.size</option>) to a comma-separated list of block sizes, - ordered from smallest to largest, with no spaces. The goal is to optimize the bucket - sizes based on your data access patterns. The following example configures buckets of - size 4096 and 8192.</para> - <screen language="xml"><![CDATA[ -<property> - <name>hfile.block.cache.sizes</name> - <value>4096,8192</value> -</property> - ]]></screen> - <note xml:id="direct.memory"> - <title>Direct Memory Usage In HBase</title> - <para>The default maximum direct memory varies by JVM. Traditionally it is 64M - or some relation to allocated heap size (-Xmx) or no limit at all (JDK7 apparently). - HBase servers use direct memory, in particular short-circuit reading, the hosted DFSClient will - allocate direct memory buffers. If you do offheap block caching, you'll - be making use of direct memory. Starting your JVM, make sure - the <varname>-XX:MaxDirectMemorySize</varname> setting in - <filename>conf/hbase-env.sh</filename> is set to some value that is - higher than what you have allocated to your offheap blockcache - (<varname>hbase.bucketcache.size</varname>). It should be larger than your offheap block - cache and then some for DFSClient usage (How much the DFSClient uses is not - easy to quantify; it is the number of open hfiles * <varname>hbase.dfs.client.read.shortcircuit.buffer.size</varname> - where hbase.dfs.client.read.shortcircuit.buffer.size is set to 128k in HBase -- see <filename>hbase-default.xml</filename> - default configurations). - Direct memory, which is part of the Java process heap, is separate from the object - heap allocated by -Xmx. The value allocated by MaxDirectMemorySize must not exceed - physical RAM, and is likely to be less than the total available RAM due to other - memory requirements and system constraints. - </para> - <para>You can see how much memory -- onheap and offheap/direct -- a RegionServer is - configured to use and how much it is using at any one time by looking at the - <emphasis>Server Metrics: Memory</emphasis> tab in the UI. It can also be gotten - via JMX. In particular the direct memory currently used by the server can be found - on the <varname>java.nio.type=BufferPool,name=direct</varname> bean. Terracotta has - a <link - xlink:href="http://terracotta.org/documentation/4.0/bigmemorygo/configuration/storage-options" - >good write up</link> on using offheap memory in java. It is for their product - BigMemory but alot of the issues noted apply in general to any attempt at going - offheap. Check it out.</para> - </note> - <note xml:id="hbase.bucketcache.percentage.in.combinedcache"><title>hbase.bucketcache.percentage.in.combinedcache</title> - <para>This is a pre-HBase 1.0 configuration removed because it - was confusing. It was a float that you would set to some value - between 0.0 and 1.0. Its default was 0.9. If the deploy was using - CombinedBlockCache, then the LruBlockCache L1 size was calculated to - be (1 - <varname>hbase.bucketcache.percentage.in.combinedcache</varname>) * <varname>size-of-bucketcache</varname> - and the BucketCache size was <varname>hbase.bucketcache.percentage.in.combinedcache</varname> * size-of-bucket-cache. - where size-of-bucket-cache itself is EITHER the value of the configuration hbase.bucketcache.size - IF it was specified as megabytes OR <varname>hbase.bucketcache.size</varname> * <varname>-XX:MaxDirectMemorySize</varname> if - <varname>hbase.bucketcache.size</varname> between 0 and 1.0. - </para> - <para>In 1.0, it should be more straight-forward. L1 LruBlockCache size - is set as a fraction of java heap using hfile.block.cache.size setting - (not the best name) and L2 is set as above either in absolute - megabytes or as a fraction of allocated maximum direct memory. - </para> - </note> - </section> - </section> - <section> - <title>Comprewssed Blockcache</title> - <para><link xlink:href="https://issues.apache.org/jira/browse/HBASE-11331" - >HBASE-11331</link> introduced lazy blockcache decompression, more simply referred to - as compressed blockcache. When compressed blockcache is enabled. data and encoded data - blocks are cached in the blockcache in their on-disk format, rather than being - decompressed and decrypted before caching.</para> - <para xlink:href="https://issues.apache.org/jira/browse/HBASE-11331">For a RegionServer - hosting more data than can fit into cache, enabling this feature with SNAPPY compression - has been shown to result in 50% increase in throughput and 30% improvement in mean - latency while, increasing garbage collection by 80% and increasing overall CPU load by - 2%. See HBASE-11331 for more details about how performance was measured and achieved. - For a RegionServer hosting data that can comfortably fit into cache, or if your workload - is sensitive to extra CPU or garbage-collection load, you may receive less - benefit.</para> - <para>Compressed blockcache is disabled by default. To enable it, set - <code>hbase.block.data.cachecompressed</code> to <code>true</code> in - <filename>hbase-site.xml</filename> on all RegionServers.</para> - </section> - </section> - - <section - xml:id="wal"> - <title>Write Ahead Log (WAL)</title> - - <section - xml:id="purpose.wal"> - <title>Purpose</title> - <para>The <firstterm>Write Ahead Log (WAL)</firstterm> records all changes to data in - HBase, to file-based storage. Under normal operations, the WAL is not needed because - data changes move from the MemStore to StoreFiles. However, if a RegionServer crashes or - becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to - the data can be replayed. If writing to the WAL fails, the entire operation to modify the - data fails.</para> - <para> - HBase uses an implementation of the <link xlink:href= - "http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/wal/WAL.html" - >WAL</link> interface. Usually, there is only one instance of a WAL per RegionServer. - The RegionServer records Puts and Deletes to it, before recording them to the <xref - linkend="store.memstore" /> for the affected <xref - linkend="store" />. - </para> - <note> - <title>The HLog</title> - <para> - Prior to 2.0, the interface for WALs in HBase was named <classname>HLog</classname>. - In 0.94, HLog was the name of the implementation of the WAL. You will likely find - references to the HLog in documentation tailored to these older versions. - </para> - </note> - <para>The WAL resides in HDFS in the <filename>/hbase/WALs/</filename> directory (prior to - HBase 0.94, they were stored in <filename>/hbase/.logs/</filename>), with subdirectories per - region.</para> - <para> For more general information about the concept of write ahead logs, see the - Wikipedia <link - xlink:href="http://en.wikipedia.org/wiki/Write-ahead_logging">Write-Ahead Log</link> - article. </para> - </section> - <section - xml:id="wal_flush"> - <title>WAL Flushing</title> - <para>TODO (describe). </para> - </section> - - <section - xml:id="wal_splitting"> - <title>WAL Splitting</title> - - <para>A RegionServer serves many regions. All of the regions in a region server share the - same active WAL file. Each edit in the WAL file includes information about which region - it belongs to. When a region is opened, the edits in the WAL file which belong to that - region need to be replayed. Therefore, edits in the WAL file must be grouped by region - so that particular sets can be replayed to regenerate the data in a particular region. - The process of grouping the WAL edits by region is called <firstterm>log - splitting</firstterm>. It is a critical process for recovering data if a region server - fails.</para> - <para>Log splitting is done by the HMaster during cluster start-up or by the ServerShutdownHandler - as a region server shuts down. So that consistency is guaranteed, affected regions - are unavailable until data is restored. All WAL edits need to be recovered and replayed - before a given region can become available again. As a result, regions affected by - log splitting are unavailable until the process completes.</para> - <procedure xml:id="log.splitting.step.by.step"> - <title>Log Splitting, Step by Step</title> - <step> - <title>The <filename>/hbase/WALs/<host>,<port>,<startcode></filename> directory is renamed.</title> - <para>Renaming the directory is important because a RegionServer may still be up and - accepting requests even if the HMaster thinks it is down. If the RegionServer does - not respond immediately and does not heartbeat its ZooKeeper session, the HMaster - may interpret this as a RegionServer failure. Renaming the logs directory ensures - that existing, valid WAL files which are still in use by an active but busy - RegionServer are not written to by accident.</para> - <para>The new directory is named according to the following pattern:</para> - <screen><![CDATA[/hbase/WALs/<host>,<port>,<startcode>-splitting]]></screen> - <para>An example of such a renamed directory might look like the following:</para> - <screen>/hbase/WALs/srv.example.com,60020,1254173957298-splitting</screen> - </step> - <step> - <title>Each log file is split, one at a time.</title> - <para>The log splitter reads the log file one edit entry at a time and puts each edit - entry into the buffer corresponding to the editâs region. At the same time, the - splitter starts several writer threads. Writer threads pick up a corresponding - buffer and write the edit entries in the buffer to a temporary recovered edit - file. The temporary edit file is stored to disk with the following naming pattern:</para> - <screen><![CDATA[/hbase/<table_name>/<region_id>/recovered.edits/.temp]]></screen> - <para>This file is used to store all the edits in the WAL log for this region. After - log splitting completes, the <filename>.temp</filename> file is renamed to the - sequence ID of the first log written to the file.</para> - <para>To determine whether all edits have been written, the sequence ID is compared to - the sequence of the last edit that was written to the HFile. If the sequence of the - last edit is greater than or equal to the sequence ID included in the file name, it - is clear that all writes from the edit file have been completed.</para> - </step> - <step> - <title>After log splitting is complete, each affected region is assigned to a - RegionServer.</title> - <para> When the region is opened, the <filename>recovered.edits</filename> folder is checked for recovered - edits files. If any such files are present, they are replayed by reading the edits - and saving them to the MemStore. After all edit files are replayed, the contents of - the MemStore are written to disk (HFile) and the edit files are deleted.</para> - </step> - </procedure> - - <section> - <title>Handling of Errors During Log Splitting</title> - - <para>If you set the <varname>hbase.hlog.split.skip.errors</varname> option to - <constant>true</constant>, errors are treated as follows:</para> - <itemizedlist> - <listitem> - <para>Any error encountered during splitting will be logged.</para> - </listitem> - <listitem> - <para>The problematic WAL log will be moved into the <filename>.corrupt</filename> - directory under the hbase <varname>rootdir</varname>,</para> - </listitem> - <listitem> - <para>Processing of the WAL will continue</para> - </listitem> - </itemizedlist> - <para>If the <varname>hbase.hlog.split.skip.errors</varname> optionset to - <literal>false</literal>, the default, the exception will be propagated and the - split will be logged as failed. See <link - xlink:href="https://issues.apache.org/jira/browse/HBASE-2958">HBASE-2958 When - hbase.hlog.split.skip.errors is set to false, we fail the split but thats - it</link>. We need to do more than just fail split if this flag is set.</para> - - <section> - <title>How EOFExceptions are treated when splitting a crashed RegionServers' - WALs</title> - - <para>If an EOFException occurs while splitting logs, the split proceeds even when - <varname>hbase.hlog.split.skip.errors</varname> is set to - <literal>false</literal>. An EOFException while reading the last log in the set of - files to split is likely, because the RegionServer is likely to be in the process of - writing a record at the time of a crash. For background, see <link - xlink:href="https://issues.apache.org/jira/browse/HBASE-2643">HBASE-2643 - Figure how to deal with eof splitting logs</link></para> - </section> - </section> - - <section> - <title>Performance Improvements during Log Splitting</title> - <para> - WAL log splitting and recovery can be resource intensive and take a long time, - depending on the number of RegionServers involved in the crash and the size of the - regions. <xref linkend="distributed.log.splitting" /> and <xref - linkend="distributed.log.replay" /> were developed to improve - performance during log splitting. - </para> - <section xml:id="distributed.log.splitting"> - <title>Distributed Log Splitting</title> - <para><firstterm>Distributed Log Splitting</firstterm> was added in HBase version 0.92 - (<link xlink:href="https://issues.apache.org/jira/browse/HBASE-1364">HBASE-1364</link>) - by Prakash Khemani from Facebook. It reduces the time to complete log splitting - dramatically, improving the availability of regions and tables. For - example, recovering a crashed cluster took around 9 hours with single-threaded log - splitting, but only about six minutes with distributed log splitting.</para> - <para>The information in this section is sourced from Jimmy Xiang's blog post at <link - xlink:href="http://blog.cloudera.com/blog/2012/07/hbase-log-splitting/" />.</para> - - <formalpara> - <title>Enabling or Disabling Distributed Log Splitting</title> - <para>Distributed log processing is enabled by default since HBase 0.92. The setting - is controlled by the <property>hbase.master.distributed.log.splitting</property> - property, which can be set to <literal>true</literal> or <literal>false</literal>, - but defaults to <literal>true</literal>. </para> - </formalpara> - <procedure> - <title>Distributed Log Splitting, Step by Step</title> - <para>After configuring distributed log splitting, the HMaster controls the process. - The HMaster enrolls each RegionServer in the log splitting process, and the actual - work of splitting the logs is done by the RegionServers. The general process for - log splitting, as described in <xref - linkend="log.splitting.step.by.step" /> still applies here.</para> - <step> - <para>If distributed log processing is enabled, the HMaster creates a - <firstterm>split log manager</firstterm> instance when the cluster is started. - The split log manager manages all log files which need - to be scanned and split. The split log manager places all the logs into the - ZooKeeper splitlog node (<filename>/hbase/splitlog</filename>) as tasks. You can - view the contents of the splitlog by issuing the following - <command>zkcli</command> command. Example output is shown.</para> - <screen language="bourne">ls /hbase/splitlog -[hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost8.sample.com%2C57020%2C1340474893275-splitting%2Fhost8.sample.com%253A57020.1340474893900, -hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost3.sample.com%2C57020%2C1340474893299-splitting%2Fhost3.sample.com%253A57020.1340474893931, -hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost4.sample.com%2C57020%2C1340474893287-splitting%2Fhost4.sample.com%253A57020.1340474893946] - </screen> - <para>The output contains some non-ASCII characters. When decoded, it looks much - more simple:</para> - <screen> -[hdfs://host2.sample.com:56020/hbase/.logs -/host8.sample.com,57020,1340474893275-splitting -/host8.sample.com%3A57020.1340474893900, -hdfs://host2.sample.com:56020/hbase/.logs -/host3.sample.com,57020,1340474893299-splitting -/host3.sample.com%3A57020.1340474893931, -hdfs://host2.sample.com:56020/hbase/.logs -/host4.sample.com,57020,1340474893287-splitting -/host4.sample.com%3A57020.1340474893946] - </screen> - <para>The listing represents WAL file names to be scanned and split, which is a - list of log splitting tasks.</para> - </step> - <step> - <title>The split log manager monitors the log-splitting tasks and workers.</title> - <para>The split log manager is responsible for the following ongoing tasks:</para> - <itemizedlist> - <listitem> - <para>Once the split log manager publishes all the tasks to the splitlog - znode, it monitors these task nodes and waits for them to be - processed.</para> - </listitem> - <listitem> - <para>Checks to see if there are any dead split log - workers queued up. If it finds tasks claimed by unresponsive workers, it - will resubmit those tasks. If the resubmit fails due to some ZooKeeper - exception, the dead worker is queued up again for retry.</para> - </listitem> - <listitem> - <para>Checks to see if there are any unassigned - tasks. If it finds any, it create an ephemeral rescan node so that each - split log worker is notified to re-scan unassigned tasks via the - <code>nodeChildrenChanged</code> ZooKeeper event.</para> - </listitem> - <listitem> - <para>Checks for tasks which are assigned but expired. If any are found, they - are moved back to <code>TASK_UNASSIGNED</code> state again so that they can - be retried. It is possible that these tasks are assigned to slow workers, or - they may already be finished. This is not a problem, because log splitting - tasks have the property of idempotence. In other words, the same log - splitting task can be processed many times without causing any - problem.</para> - </listitem> - <listitem> - <para>The split log manager watches the HBase split log znodes constantly. If - any split log task node data is changed, the split log manager retrieves the - node data. The - node data contains the current state of the task. You can use the - <command>zkcli</command> <command>get</command> command to retrieve the - current state of a task. In the example output below, the first line of the - output shows that the task is currently unassigned.</para> - <screen> -<userinput>get /hbase/splitlog/hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost6.sample.com%2C57020%2C1340474893287-splitting%2Fhost6.sample.com%253A57020.1340474893945 -</userinput> -<computeroutput>unassigned host2.sample.com:57000 -cZxid = 0Ã7115 -ctime = Sat Jun 23 11:13:40 PDT 2012 -...</computeroutput> - </screen> - <para>Based on the state of the task whose data is changed, the split log - manager does one of the following:</para> - - <itemizedlist> - <listitem> - <para>Resubmit the task if it is unassigned</para> - </listitem> - <listitem> - <para>Heartbeat the task if it is assigned</para> - </listitem> - <listitem> - <para>Resubmit or fail the task if it is resigned (see <xref - linkend="distributed.log.replay.failure.reasons" />)</para> - </listitem> - <listitem> - <para>Resubmit or fail the task if it is completed with errors (see <xref - linkend="distributed.log.replay.failure.reasons" />)</para> - </listitem> - <listitem> - <para>Resubmit or fail the task if it could not complete due to - errors (see <xref - linkend="distributed.log.replay.failure.reasons" />)</para> - </listitem> - <listitem> - <para>Delete the task if it is successfully completed or failed</para> - </listitem> - </itemizedlist> - <itemizedlist xml:id="distributed.log.replay.failure.reasons"> - <title>Reasons a Task Will Fail</title> - <listitem><para>The task has been deleted.</para></listitem> - <listitem><para>The node no longer exists.</para></listitem> - <listitem><para>The log status manager failed to move the state of the task - to TASK_UNASSIGNED.</para></listitem> - <listitem><para>The number of resubmits is over the resubmit - threshold.</para></listitem> - </itemizedlist> - </listitem> - </itemizedlist> - </step> - <step> - <title>Each RegionServer's split log worker performs the log-splitting tasks.</title> - <para>Each RegionServer runs a daemon thread called the <firstterm>split log - worker</firstterm>, which does the work to split the logs. The daemon thread - starts when the RegionServer starts, and registers itself to watch HBase znodes. - If any splitlog znode children change, it notifies a sleeping worker thread to - wake up and grab more tasks. If if a worker's current taskâs node data is - changed, the worker checks to see if the task has been taken by another worker. - If so, the worker thread stops work on the current task.</para> - <para>The worker monitors - the splitlog znode constantly. When a new task appears, the split log worker - retrieves the task paths and checks each one until it finds an unclaimed task, - which it attempts to claim. If the claim was successful, it attempts to perform - the task and updates the task's <property>state</property> property based on the - splitting outcome. At this point, the split log worker scans for another - unclaimed task.</para> - <itemizedlist> - <title>How the Split Log Worker Approaches a Task</title> - - <listitem> - <para>It queries the task state and only takes action if the task is in - <literal>TASK_UNASSIGNED </literal>state.</para> - </listitem> - <listitem> - <para>If the task is is in <literal>TASK_UNASSIGNED</literal> state, the - worker attempts to set the state to <literal>TASK_OWNED</literal> by itself. - If it fails to set the state, another worker will try to grab it. The split - log manager will also ask all workers to rescan later if the task remains - unassigned.</para> - </listitem> - <listitem> - <para>If the worker succeeds in taking ownership of the task, it tries to get - the task state again to make sure it really gets it asynchronously. In the - meantime, it starts a split task executor to do the actual work: </para> - <itemizedlist> - <listitem> - <para>Get the HBase root folder, create a temp folder under the root, and - split the log file to the temp folder.</para> - </listitem> - <listitem> - <para>If the split was successful, the task executor sets the task to - state <literal>TASK_DONE</literal>.</para> - </listitem> - <listitem> - <para>If the worker catches an unexpected IOException, the task is set to - state <literal>TASK_ERR</literal>.</para> - </listitem> - <listitem> - <para>If the worker is shutting down, set the the task to state - <literal>TASK_RESIGNED</literal>.</para> - </listitem> - <listitem> - <para>If the task is taken by another worker, just log it.</para> - </listitem> - </itemizedlist> - </listitem> - </itemizedlist> - </step> - <step> - <title>The split log manager monitors for uncompleted tasks.</title> - <para>The split log manager returns when all tasks are completed successfully. If - all tasks are completed with some failures, the split log manager throws an - exception so that the log splitting can be retried. Due to an asynchronous - implementation, in very rare cases, the split log manager loses track of some - completed tasks. For that reason, it periodically checks for remaining - uncompleted task in its task map or ZooKeeper. If none are found, it throws an - exception so that the log splitting can be retried right away instead of hanging - there waiting for something that wonât happen.</para> - </step> - </procedure> - </section> - <section xml:id="distributed.log.replay"> - <title>Distributed Log Replay</title> - <para>After a RegionServer fails, its failed region is assigned to another - RegionServer, which is marked as "recovering" in ZooKeeper. A split log worker directly - replays edits from the WAL of the failed region server to the region at its new - location. When a region is in "recovering" state, it can accept writes but no reads - (including Append and Increment), region splits or merges. </para> - <para>Distributed Log Replay extends the <xref linkend="distributed.log.splitting" /> framework. It works by - directly replaying WAL edits to another RegionServer instead of creating - <filename>recovered.edits</filename> files. It provides the following advantages - over distributed log splitting alone:</para> - <itemizedlist> - <listitem><para>It eliminates the overhead of writing and reading a large number of - <filename>recovered.edits</filename> files. It is not unusual for thousands of - <filename>recovered.edits</filename> files to be created and written concurrently - during a RegionServer recovery. Many small random writes can degrade overall - system performance.</para></listitem> - <listitem><para>It allows writes even when a region is in recovering state. It only takes seconds for a recovering region to accept writes again. -</para></listitem> - </itemizedlist> - <formalpara> - <title>Enabling Distributed Log Replay</title> - <para>To enable distributed log replay, set <varname>hbase.master.distributed.log.replay</varname> to - true. This will be the default for HBase 0.99 (<link - xlink:href="https://issues.apache.org/jira/browse/HBASE-10888">HBASE-10888</link>).</para> - </formalpara> - <para>You must also enable HFile version 3 (which is the default HFile format starting - in HBase 0.99. See <link - xlink:href="https://issues.apache.org/jira/browse/HBASE-10855">HBASE-
<TRUNCATED>
