http://git-wip-us.apache.org/repos/asf/hbase/blob/5fbf80ee/src/main/asciidoc/_chapters/case_studies.adoc ---------------------------------------------------------------------- diff --git a/src/main/asciidoc/_chapters/case_studies.adoc b/src/main/asciidoc/_chapters/case_studies.adoc index f6a43ee..3746c2a 100644 --- a/src/main/asciidoc/_chapters/case_studies.adoc +++ b/src/main/asciidoc/_chapters/case_studies.adoc @@ -84,16 +84,16 @@ If system memory is heavily overcommitted, the Linux kernel may enter a vicious Further, a failing hard disk will often retry reads and/or writes many times before giving up and returning an error. This can manifest as high iowait, as running processes wait for reads and writes to complete. Finally, a disk nearing the upper edge of its performance envelope will begin to cause iowait as it informs the kernel that it cannot accept any more data, and the kernel queues incoming data into the dirty write pool in memory. -However, using [code]+vmstat(1)+ and [code]+free(1)+, we could see that no swap was being used, and the amount of disk IO was only a few kilobytes per second. +However, using `vmstat(1)` and `free(1)`, we could see that no swap was being used, and the amount of disk IO was only a few kilobytes per second. ===== Slowness Due To High Processor Usage -Next, we checked to see whether the system was performing slowly simply due to very high computational load. [code]+top(1)+ showed that the system load was higher than normal, but [code]+vmstat(1)+ and [code]+mpstat(1)+ showed that the amount of processor being used for actual computation was low. +Next, we checked to see whether the system was performing slowly simply due to very high computational load. `top(1)` showed that the system load was higher than normal, but `vmstat(1)` and `mpstat(1)` showed that the amount of processor being used for actual computation was low. ===== Network Saturation (The Winner) Since neither the disks nor the processors were being utilized heavily, we moved on to the performance of the network interfaces. -The datanode had two gigabit ethernet adapters, bonded to form an active-standby interface. [code]+ifconfig(8)+ showed some unusual anomalies, namely interface errors, overruns, framing errors. +The datanode had two gigabit ethernet adapters, bonded to form an active-standby interface. `ifconfig(8)` showed some unusual anomalies, namely interface errors, overruns, framing errors. While not unheard of, these kinds of errors are exceedingly rare on modern hardware which is operating as it should: ---- @@ -109,7 +109,7 @@ RX bytes:2416328868676 (2.4 TB) TX bytes:3464991094001 (3.4 TB) ---- These errors immediately lead us to suspect that one or more of the ethernet interfaces might have negotiated the wrong line speed. -This was confirmed both by running an ICMP ping from an external host and observing round-trip-time in excess of 700ms, and by running [code]+ethtool(8)+ on the members of the bond interface and discovering that the active interface was operating at 100Mbs/, full duplex. +This was confirmed both by running an ICMP ping from an external host and observing round-trip-time in excess of 700ms, and by running `ethtool(8)` on the members of the bond interface and discovering that the active interface was operating at 100Mbs/, full duplex. ---- @@ -145,7 +145,7 @@ In normal operation, the ICMP ping round trip time should be around 20ms, and th ==== Resolution -After determining that the active ethernet adapter was at the incorrect speed, we used the [code]+ifenslave(8)+ command to make the standby interface the active interface, which yielded an immediate improvement in MapReduce performance, and a 10 times improvement in network throughput: +After determining that the active ethernet adapter was at the incorrect speed, we used the `ifenslave(8)` command to make the standby interface the active interface, which yielded an immediate improvement in MapReduce performance, and a 10 times improvement in network throughput: On the next trip to the datacenter, we determined that the line speed issue was ultimately caused by a bad network cable, which was replaced. @@ -163,6 +163,6 @@ Although this research is on an older version of the codebase, this writeup is s [[casestudies.max.transfer.threads]] === Case Study #4 (max.transfer.threads Config) -Case study of configuring [code]+max.transfer.threads+ (previously known as [code]+xcievers+) and diagnosing errors from misconfigurations. link:http://www.larsgeorge.com/2012/03/hadoop-hbase-and-xceivers.html +Case study of configuring `max.transfer.threads` (previously known as `xcievers`) and diagnosing errors from misconfigurations. link:http://www.larsgeorge.com/2012/03/hadoop-hbase-and-xceivers.html See also <<dfs.datanode.max.transfer.threads,dfs.datanode.max.transfer.threads>>.
http://git-wip-us.apache.org/repos/asf/hbase/blob/5fbf80ee/src/main/asciidoc/_chapters/compression.adoc ---------------------------------------------------------------------- diff --git a/src/main/asciidoc/_chapters/compression.adoc b/src/main/asciidoc/_chapters/compression.adoc index 75a050f..42d4de5 100644 --- a/src/main/asciidoc/_chapters/compression.adoc +++ b/src/main/asciidoc/_chapters/compression.adoc @@ -65,12 +65,12 @@ To enable data block encoding for a ColumnFamily, see <<data.block.encoding.enab .Data Block Encoding Types Prefix:: - Often, keys are very similar. Specifically, keys often share a common prefix and only differ near the end. For instance, one key might be [literal]+RowKey:Family:Qualifier0+ and the next key might be [literal]+RowKey:Family:Qualifier1+. + Often, keys are very similar. Specifically, keys often share a common prefix and only differ near the end. For instance, one key might be `RowKey:Family:Qualifier0` and the next key might be `RowKey:Family:Qualifier1`. + In Prefix encoding, an extra column is added which holds the length of the prefix shared between the current key and the previous key. Assuming the first key here is totally different from the key before, its prefix length is 0. + -The second key's prefix length is [literal]+23+, since they have the first 23 characters in common. +The second key's prefix length is `23`, since they have the first 23 characters in common. + Obviously if the keys tend to have nothing in common, Prefix will not provide much benefit. + @@ -168,10 +168,10 @@ bzip2: false Above shows that the native hadoop library is not available in HBase context. To fix the above, either copy the Hadoop native libraries local or symlink to them if the Hadoop and HBase stalls are adjacent in the filesystem. -You could also point at their location by setting the [var]+LD_LIBRARY_PATH+ environment variable. +You could also point at their location by setting the `LD_LIBRARY_PATH` environment variable. -Where the JVM looks to find native librarys is "system dependent" (See [class]+java.lang.System#loadLibrary(name)+). On linux, by default, is going to look in [path]_lib/native/PLATFORM_ where [var]+PLATFORM+ is the label for the platform your HBase is installed on. -On a local linux machine, it seems to be the concatenation of the java properties [var]+os.name+ and [var]+os.arch+ followed by whether 32 or 64 bit. +Where the JVM looks to find native librarys is "system dependent" (See `java.lang.System#loadLibrary(name)`). On linux, by default, is going to look in _lib/native/PLATFORM_ where `PLATFORM` is the label for the platform your HBase is installed on. +On a local linux machine, it seems to be the concatenation of the java properties `os.name` and `os.arch` followed by whether 32 or 64 bit. HBase on startup prints out all of the java system properties so find the os.name and os.arch in the log. For example: [source] @@ -181,12 +181,12 @@ For example: 2014-08-06 15:27:22,853 INFO [main] zookeeper.ZooKeeper: Client environment:os.arch=amd64 ... ---- -So in this case, the PLATFORM string is [var]+Linux-amd64-64+. -Copying the Hadoop native libraries or symlinking at [path]_lib/native/Linux-amd64-64_ will ensure they are found. -Check with the Hadoop [path]_NativeLibraryChecker_. +So in this case, the PLATFORM string is `Linux-amd64-64`. +Copying the Hadoop native libraries or symlinking at _lib/native/Linux-amd64-64_ will ensure they are found. +Check with the Hadoop _NativeLibraryChecker_. -Here is example of how to point at the Hadoop libs with [var]+LD_LIBRARY_PATH+ environment variable: +Here is example of how to point at the Hadoop libs with `LD_LIBRARY_PATH` environment variable: [source] ---- $ LD_LIBRARY_PATH=~/hadoop-2.5.0-SNAPSHOT/lib/native ./bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker @@ -199,7 +199,7 @@ snappy: true /usr/lib64/libsnappy.so.1 lz4: true revision:99 bzip2: true /lib64/libbz2.so.1 ---- -Set in [path]_hbase-env.sh_ the LD_LIBRARY_PATH environment variable when starting your HBase. +Set in _hbase-env.sh_ the LD_LIBRARY_PATH environment variable when starting your HBase. === Compressor Configuration, Installation, and Use @@ -215,17 +215,16 @@ See .Compressor Support On the Master A new configuration setting was introduced in HBase 0.95, to check the Master to determine which data block encoders are installed and configured on it, and assume that the entire cluster is configured the same. -This option, [code]+hbase.master.check.compression+, defaults to [literal]+true+. +This option, `hbase.master.check.compression`, defaults to `true`. This prevents the situation described in link:https://issues.apache.org/jira/browse/HBASE-6370[HBASE-6370], where a table is created or modified to support a codec that a region server does not support, leading to failures that take a long time to occur and are difficult to debug. -If [code]+hbase.master.check.compression+ is enabled, libraries for all desired compressors need to be installed and configured on the Master, even if the Master does not run a region server. +If `hbase.master.check.compression` is enabled, libraries for all desired compressors need to be installed and configured on the Master, even if the Master does not run a region server. .Install GZ Support Via Native Libraries HBase uses Java's built-in GZip support unless the native Hadoop libraries are available on the CLASSPATH. -The recommended way to add libraries to the CLASSPATH is to set the environment variable [var]+HBASE_LIBRARY_PATH+ for the user running HBase. -If native libraries are not available and Java's GZIP is used, [literal]+Got - brand-new compressor+ reports will be present in the logs. +The recommended way to add libraries to the CLASSPATH is to set the environment variable `HBASE_LIBRARY_PATH` for the user running HBase. +If native libraries are not available and Java's GZIP is used, `Got brand-new compressor` reports will be present in the logs. See <<brand.new.compressor,brand.new.compressor>>). [[lzo.compression]] @@ -264,10 +263,10 @@ hbase(main):003:0> alter 'TestTable', {NAME => 'info', COMPRESSION => 'LZ4'} HBase does not ship with Snappy support because of licensing issues. You can install Snappy binaries (for instance, by using +yum install snappy+ on CentOS) or build Snappy from source. -After installing Snappy, search for the shared library, which will be called [path]_libsnappy.so.X_ where X is a number. -If you built from source, copy the shared library to a known location on your system, such as [path]_/opt/snappy/lib/_. +After installing Snappy, search for the shared library, which will be called _libsnappy.so.X_ where X is a number. +If you built from source, copy the shared library to a known location on your system, such as _/opt/snappy/lib/_. -In addition to the Snappy library, HBase also needs access to the Hadoop shared library, which will be called something like [path]_libhadoop.so.X.Y_, where X and Y are both numbers. +In addition to the Snappy library, HBase also needs access to the Hadoop shared library, which will be called something like _libhadoop.so.X.Y_, where X and Y are both numbers. Make note of the location of the Hadoop library, or copy it to the same location as the Snappy library. [NOTE] @@ -278,7 +277,7 @@ See <<compression.test,compression.test>> to find out how to test that this is t See <<hbase.regionserver.codecs,hbase.regionserver.codecs>> to configure your RegionServers to fail to start if a given compressor is not available. ==== -Each of these library locations need to be added to the environment variable [var]+HBASE_LIBRARY_PATH+ for the operating system user that runs HBase. +Each of these library locations need to be added to the environment variable `HBASE_LIBRARY_PATH` for the operating system user that runs HBase. You need to restart the RegionServer for the changes to take effect. [[compression.test]] @@ -294,14 +293,14 @@ You can use the CompressionTest tool to verify that your compressor is available [[hbase.regionserver.codecs]] .Enforce Compression Settings On a RegionServer -You can configure a RegionServer so that it will fail to restart if compression is configured incorrectly, by adding the option hbase.regionserver.codecs to the [path]_hbase-site.xml_, and setting its value to a comma-separated list of codecs that need to be available. -For example, if you set this property to [literal]+lzo,gz+, the RegionServer would fail to start if both compressors were not available. +You can configure a RegionServer so that it will fail to restart if compression is configured incorrectly, by adding the option hbase.regionserver.codecs to the _hbase-site.xml_, and setting its value to a comma-separated list of codecs that need to be available. +For example, if you set this property to `lzo,gz`, the RegionServer would fail to start if both compressors were not available. This would prevent a new server from being added to the cluster without having codecs configured properly. [[changing.compression]] ==== Enable Compression On a ColumnFamily -To enable compression for a ColumnFamily, use an [code]+alter+ command. +To enable compression for a ColumnFamily, use an `alter` command. You do not need to re-create the table or copy data. If you are changing codecs, be sure the old codec is still available until all the old StoreFiles have been compacted. @@ -342,7 +341,7 @@ DESCRIPTION ENABLED ==== Testing Compression Performance HBase includes a tool called LoadTestTool which provides mechanisms to test your compression performance. -You must specify either [literal]+-write+ or [literal]+-update-read+ as your first parameter, and if you do not specify another parameter, usage advice is printed for each option. +You must specify either `-write` or `-update-read` as your first parameter, and if you do not specify another parameter, usage advice is printed for each option. .+LoadTestTool+ Usage ==== @@ -415,7 +414,7 @@ $ hbase org.apache.hadoop.hbase.util.LoadTestTool -write 1:10:100 -num_keys 1000 == Enable Data Block Encoding Codecs are built into HBase so no extra configuration is needed. -Codecs are enabled on a table by setting the [code]+DATA_BLOCK_ENCODING+ property. +Codecs are enabled on a table by setting the `DATA_BLOCK_ENCODING` property. Disable the table before altering its DATA_BLOCK_ENCODING setting. Following is an example using HBase Shell: http://git-wip-us.apache.org/repos/asf/hbase/blob/5fbf80ee/src/main/asciidoc/_chapters/configuration.adoc ---------------------------------------------------------------------- diff --git a/src/main/asciidoc/_chapters/configuration.adoc b/src/main/asciidoc/_chapters/configuration.adoc index 98cb835..a48281d 100644 --- a/src/main/asciidoc/_chapters/configuration.adoc +++ b/src/main/asciidoc/_chapters/configuration.adoc @@ -30,41 +30,42 @@ This chapter expands upon the <<getting_started,getting started>> chapter to further explain configuration of Apache HBase. Please read this chapter carefully, especially <<basic.prerequisites,basic.prerequisites>> to ensure that your HBase testing and deployment goes smoothly, and prevent data loss. +== Configuration Files Apache HBase uses the same configuration system as Apache Hadoop. -All configuration files are located in the [path]_conf/_ directory, which needs to be kept in sync for each node on your cluster. +All configuration files are located in the _conf/_ directory, which needs to be kept in sync for each node on your cluster. -.HBase Configuration Files -[path]_backup-masters_:: +.HBase Configuration File Descriptions +_backup-masters_:: Not present by default. A plain-text file which lists hosts on which the Master should start a backup Master process, one host per line. -[path]_hadoop-metrics2-hbase.properties_:: +_hadoop-metrics2-hbase.properties_:: Used to connect HBase Hadoop's Metrics2 framework. See the link:http://wiki.apache.org/hadoop/HADOOP-6728-MetricsV2[Hadoop Wiki entry] for more information on Metrics2. Contains only commented-out examples by default. -[path]_hbase-env.cmd_ and [path]_hbase-env.sh_:: +_hbase-env.cmd_ and _hbase-env.sh_:: Script for Windows and Linux / Unix environments to set up the working environment for HBase, including the location of Java, Java options, and other environment variables. The file contains many commented-out examples to provide guidance. -[path]_hbase-policy.xml_:: +_hbase-policy.xml_:: The default policy configuration file used by RPC servers to make authorization decisions on client requests. Only used if HBase security (<<security,security>>) is enabled. -[path]_hbase-site.xml_:: +_hbase-site.xml_:: The main HBase configuration file. This file specifies configuration options which override HBase's default configuration. - You can view (but do not edit) the default configuration file at [path]_docs/hbase-default.xml_. + You can view (but do not edit) the default configuration file at _docs/hbase-default.xml_. You can also view the entire effective configuration for your cluster (defaults and overrides) in the [label]#HBase Configuration# tab of the HBase Web UI. -[path]_log4j.properties_:: - Configuration file for HBase logging via [code]+log4j+. +_log4j.properties_:: + Configuration file for HBase logging via `log4j`. -[path]_regionservers_:: +_regionservers_:: A plain-text file containing a list of hosts which should run a RegionServer in your HBase cluster. - By default this file contains the single entry [literal]+localhost+. - It should contain a list of hostnames or IP addresses, one per line, and should only contain [literal]+localhost+ if each node in your cluster will run a RegionServer on its [literal]+localhost+ interface. + By default this file contains the single entry `localhost`. + It should contain a list of hostnames or IP addresses, one per line, and should only contain `localhost` if each node in your cluster will run a RegionServer on its `localhost` interface. .Checking XML Validity [TIP] @@ -75,11 +76,10 @@ By default, +xmllint+ re-flows and prints the XML to standard output. To check for well-formedness and only print output if errors exist, use the command +xmllint -noout filename.xml+. ==== - .Keep Configuration In Sync Across the Cluster [WARNING] ==== -When running in distributed mode, after you make an edit to an HBase configuration, make sure you copy the content of the [path]_conf/_ directory to all nodes of the cluster. +When running in distributed mode, after you make an edit to an HBase configuration, make sure you copy the content of the _conf/_ directory to all nodes of the cluster. HBase will not do this for you. Use +rsync+, +scp+, or another secure mechanism for copying the configuration files to your nodes. For most configuration, a restart is needed for servers to pick up changes An exception is dynamic configuration. @@ -123,7 +123,7 @@ support. |N/A |=== -NOTE: In HBase 0.98.5 and newer, you must set `JAVA_HOME` on each node of your cluster. [path]_hbase-env.sh_ provides a handy mechanism to do this. +NOTE: In HBase 0.98.5 and newer, you must set `JAVA_HOME` on each node of your cluster. _hbase-env.sh_ provides a handy mechanism to do this. .Operating System Utilities ssh:: @@ -133,14 +133,14 @@ DNS:: HBase uses the local hostname to self-report its IP address. Both forward and reverse DNS resolving must work in versions of HBase previous to 0.92.0. The link:https://github.com/sujee/hadoop-dns-checker[hadoop-dns-checker] tool can be used to verify DNS is working correctly on the cluster. The project README file provides detailed instructions on usage. Loopback IP:: - Prior to hbase-0.96.0, HBase only used the IP address [systemitem]+127.0.0.1+ to refer to [code]+localhost+, and this could not be configured. + Prior to hbase-0.96.0, HBase only used the IP address [systemitem]+127.0.0.1+ to refer to `localhost`, and this could not be configured. See <<loopback.ip,loopback.ip>>. NTP:: The clocks on cluster nodes should be synchronized. A small amount of variation is acceptable, but larger amounts of skew can cause erratic and unexpected behavior. Time synchronization is one of the first things to check if you see unexplained problems in your cluster. It is recommended that you run a Network Time Protocol (NTP) service, or another time-synchronization mechanism, on your cluster, and that all nodes look to the same service for time synchronization. See the link:http://www.tldp.org/LDP/sag/html/basic-ntp-config.html[Basic NTP Configuration] at [citetitle]_The Linux Documentation Project (TLDP)_ to set up NTP. Limits on Number of Files and Processes (ulimit):: - Apache HBase is a database. It requires the ability to open a large number of files at once. Many Linux distributions limit the number of files a single user is allowed to open to [literal]+1024+ (or [literal]+256+ on older versions of OS X). You can check this limit on your servers by running the command +ulimit -n+ when logged in as the user which runs HBase. See <<trouble.rs.runtime.filehandles,trouble.rs.runtime.filehandles>> for some of the problems you may experience if the limit is too low. You may also notice errors such as the following: + Apache HBase is a database. It requires the ability to open a large number of files at once. Many Linux distributions limit the number of files a single user is allowed to open to `1024` (or `256` on older versions of OS X). You can check this limit on your servers by running the command +ulimit -n+ when logged in as the user which runs HBase. See <<trouble.rs.runtime.filehandles,trouble.rs.runtime.filehandles>> for some of the problems you may experience if the limit is too low. You may also notice errors such as the following: + ---- 2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException @@ -217,7 +217,7 @@ Use the following legend to interpret this table: .Replace the Hadoop Bundled With HBase! [NOTE] ==== -Because HBase depends on Hadoop, it bundles an instance of the Hadoop jar under its [path]_lib_ directory. +Because HBase depends on Hadoop, it bundles an instance of the Hadoop jar under its _lib_ directory. The bundled jar is ONLY for use in standalone mode. In distributed mode, it is _critical_ that the version of Hadoop that is out on your cluster match what is under HBase. Replace the hadoop jar found in the HBase lib directory with the hadoop jar you are running on your cluster to avoid version mismatch issues. @@ -228,7 +228,7 @@ Hadoop version mismatch issues have various manifestations but often all looks l [[hadoop2.hbase_0.94]] ==== Apache HBase 0.94 with Hadoop 2 -To get 0.94.x to run on hadoop 2.2.0, you need to change the hadoop 2 and protobuf versions in the [path]_pom.xml_: Here is a diff with pom.xml changes: +To get 0.94.x to run on hadoop 2.2.0, you need to change the hadoop 2 and protobuf versions in the _pom.xml_: Here is a diff with pom.xml changes: [source] ---- @@ -298,14 +298,14 @@ Do not move to Apache HBase 0.96.x if you cannot upgrade your Hadoop.. See link: [[hadoop.older.versions]] ==== Hadoop versions 0.20.x - 1.x -HBase will lose data unless it is running on an HDFS that has a durable [code]+sync+ implementation. +HBase will lose data unless it is running on an HDFS that has a durable `sync` implementation. DO NOT use Hadoop 0.20.2, Hadoop 0.20.203.0, and Hadoop 0.20.204.0 which DO NOT have this attribute. Currently only Hadoop versions 0.20.205.x or any release in excess of this version -- this includes hadoop-1.0.0 -- have a working, durable sync. The Cloudera blog post link:http://www.cloudera.com/blog/2012/01/an-update-on-apache-hadoop-1-0/[An update on Apache Hadoop 1.0] by Charles Zedlweski has a nice exposition on how all the Hadoop versions relate. Its worth checking out if you are having trouble making sense of the Hadoop version morass. -Sync has to be explicitly enabled by setting [var]+dfs.support.append+ equal to true on both the client side -- in [path]_hbase-site.xml_ -- and on the serverside in [path]_hdfs-site.xml_ (The sync facility HBase needs is a subset of the append code path). +Sync has to be explicitly enabled by setting `dfs.support.append` equal to true on both the client side -- in _hbase-site.xml_ -- and on the serverside in _hdfs-site.xml_ (The sync facility HBase needs is a subset of the append code path). [source,xml] ---- @@ -317,7 +317,7 @@ Sync has to be explicitly enabled by setting [var]+dfs.support.append+ equal to ---- You will have to restart your cluster after making this edit. -Ignore the chicken-little comment you'll find in the [path]_hdfs-default.xml_ in the description for the [var]+dfs.support.append+ configuration. +Ignore the chicken-little comment you'll find in the _hdfs-default.xml_ in the description for the `dfs.support.append` configuration. [[hadoop.security]] ==== Apache HBase on Secure Hadoop @@ -325,12 +325,12 @@ Ignore the chicken-little comment you'll find in the [path]_hdfs-default.xml_ in Apache HBase will run on any Hadoop 0.20.x that incorporates Hadoop security features as long as you do as suggested above and replace the Hadoop jar that ships with HBase with the secure version. If you want to read more about how to setup Secure HBase, see <<hbase.secure.configuration,hbase.secure.configuration>>. -[var]+dfs.datanode.max.transfer.threads+ +`dfs.datanode.max.transfer.threads` [[dfs.datanode.max.transfer.threads]] ==== (((dfs.datanode.max.transfer.threads))) An HDFS datanode has an upper bound on the number of files that it will serve at any one time. -Before doing any loading, make sure you have configured Hadoop's [path]_conf/hdfs-site.xml_, setting the [var]+dfs.datanode.max.transfer.threads+ value to at least the following: +Before doing any loading, make sure you have configured Hadoop's _conf/hdfs-site.xml_, setting the `dfs.datanode.max.transfer.threads` value to at least the following: [source,xml] ---- @@ -353,7 +353,7 @@ For example: contain current block. Will get new block locations from namenode and retry... ---- -See also <<casestudies.max.transfer.threads,casestudies.max.transfer.threads>> and note that this property was previously known as [var]+dfs.datanode.max.xcievers+ (e.g. link:http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html[ +See also <<casestudies.max.transfer.threads,casestudies.max.transfer.threads>> and note that this property was previously known as `dfs.datanode.max.xcievers` (e.g. link:http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html[ Hadoop HDFS: Deceived by Xciever]). [[zookeeper.requirements]] @@ -367,10 +367,10 @@ HBase makes use of the [method]+multi+ functionality that is only available sinc HBase has two run modes: <<standalone,standalone>> and <<distributed,distributed>>. Out of the box, HBase runs in standalone mode. -Whatever your mode, you will need to configure HBase by editing files in the HBase [path]_conf_ directory. -At a minimum, you must edit [code]+conf/hbase-env.sh+ to tell HBase which +java+ to use. +Whatever your mode, you will need to configure HBase by editing files in the HBase _conf_ directory. +At a minimum, you must edit `conf/hbase-env.sh` to tell HBase which +java+ to use. In this file you set HBase environment variables such as the heapsize and other options for the +JVM+, the preferred location for log files, etc. -Set [var]+JAVA_HOME+ to point at the root of your +java+ install. +Set `JAVA_HOME` to point at the root of your +java+ install. [[standalone]] === Standalone HBase @@ -417,15 +417,15 @@ Both standalone mode and pseudo-distributed mode are provided for the purposes o For a production environment, distributed mode is appropriate. In distributed mode, multiple instances of HBase daemons run on multiple servers in the cluster. -Just as in pseudo-distributed mode, a fully distributed configuration requires that you set the [code]+hbase-cluster.distributed+ property to [literal]+true+. -Typically, the [code]+hbase.rootdir+ is configured to point to a highly-available HDFS filesystem. +Just as in pseudo-distributed mode, a fully distributed configuration requires that you set the `hbase-cluster.distributed` property to `true`. +Typically, the `hbase.rootdir` is configured to point to a highly-available HDFS filesystem. In addition, the cluster is configured so that multiple cluster nodes enlist as RegionServers, ZooKeeper QuorumPeers, and backup HMaster servers. These configuration basics are all demonstrated in <<quickstart_fully_distributed,quickstart-fully-distributed>>. .Distributed RegionServers Typically, your cluster will contain multiple RegionServers all running on different servers, as well as primary and backup Master and Zookeeper daemons. -The [path]_conf/regionservers_ file on the master server contains a list of hosts whose RegionServers are associated with this cluster. +The _conf/regionservers_ file on the master server contains a list of hosts whose RegionServers are associated with this cluster. Each host is on a separate line. All hosts listed in this file will have their RegionServer processes started and stopped when the master server starts or stops. @@ -434,9 +434,9 @@ See section <<zookeeper,zookeeper>> for ZooKeeper setup for HBase. .Example Distributed HBase Cluster ==== -This is a bare-bones [path]_conf/hbase-site.xml_ for a distributed HBase cluster. +This is a bare-bones _conf/hbase-site.xml_ for a distributed HBase cluster. A cluster that is used for real-world work would contain more custom configuration parameters. -Most HBase configuration directives have default values, which are used unless the value is overridden in the [path]_hbase-site.xml_. +Most HBase configuration directives have default values, which are used unless the value is overridden in the _hbase-site.xml_. See <<config.files,config.files>> for more information. [source,xml] @@ -458,8 +458,8 @@ See <<config.files,config.files>> for more information. </configuration> ---- -This is an example [path]_conf/regionservers_ file, which contains a list of each node that should run a RegionServer in the cluster. -These nodes need HBase installed and they need to use the same contents of the [path]_conf/_ directory as the Master server.. +This is an example _conf/regionservers_ file, which contains a list of each node that should run a RegionServer in the cluster. +These nodes need HBase installed and they need to use the same contents of the _conf/_ directory as the Master server.. [source] ---- @@ -469,7 +469,7 @@ node-b.example.com node-c.example.com ---- -This is an example [path]_conf/backup-masters_ file, which contains a list of each node that should run a backup Master instance. +This is an example _conf/backup-masters_ file, which contains a list of each node that should run a backup Master instance. The backup Master instances will sit idle unless the main Master becomes unavailable. [source] @@ -486,19 +486,19 @@ See <<quickstart_fully_distributed,quickstart-fully-distributed>> for a walk-thr .Procedure: HDFS Client Configuration . Of note, if you have made HDFS client configuration on your Hadoop cluster, such as configuration directives for HDFS clients, as opposed to server-side configurations, you must use one of the following methods to enable HBase to see and use these configuration changes: + -a. Add a pointer to your [var]+HADOOP_CONF_DIR+ to the [var]+HBASE_CLASSPATH+ environment variable in [path]_hbase-env.sh_. -b. Add a copy of [path]_hdfs-site.xml_ (or [path]_hadoop-site.xml_) or, better, symlinks, under [path]_${HBASE_HOME}/conf_, or -c. if only a small set of HDFS client configurations, add them to [path]_hbase-site.xml_. +a. Add a pointer to your `HADOOP_CONF_DIR` to the `HBASE_CLASSPATH` environment variable in _hbase-env.sh_. +b. Add a copy of _hdfs-site.xml_ (or _hadoop-site.xml_) or, better, symlinks, under _${HBASE_HOME}/conf_, or +c. if only a small set of HDFS client configurations, add them to _hbase-site.xml_. -An example of such an HDFS client configuration is [var]+dfs.replication+. +An example of such an HDFS client configuration is `dfs.replication`. If for example, you want to run with a replication factor of 5, hbase will create files with the default of 3 unless you do the above to make the configuration available to HBase. [[confirm]] == Running and Confirming Your Installation Make sure HDFS is running first. -Start and stop the Hadoop HDFS daemons by running [path]_bin/start-hdfs.sh_ over in the [var]+HADOOP_HOME+ directory. +Start and stop the Hadoop HDFS daemons by running _bin/start-hdfs.sh_ over in the `HADOOP_HOME` directory. You can ensure it started properly by testing the +put+ and +get+ of files into the Hadoop filesystem. HBase does not normally use the mapreduce daemons. These do not need to be started. @@ -511,14 +511,14 @@ Start HBase with the following command: bin/start-hbase.sh ---- -Run the above from the [var]+HBASE_HOME+ directory. +Run the above from the `HBASE_HOME` directory. You should now have a running HBase instance. -HBase logs can be found in the [path]_logs_ subdirectory. +HBase logs can be found in the _logs_ subdirectory. Check them out especially if HBase had trouble starting. HBase also puts up a UI listing vital attributes. -By default its deployed on the Master host at port 16010 (HBase RegionServers listen on port 16020 by default and put up an informational http server at 16030). If the Master were running on a host named [var]+master.example.org+ on the default port, to see the Master's homepage you'd point your browser at [path]_http://master.example.org:16010_. +By default its deployed on the Master host at port 16010 (HBase RegionServers listen on port 16020 by default and put up an informational http server at 16030). If the Master were running on a host named `master.example.org` on the default port, to see the Master's homepage you'd point your browser at _http://master.example.org:16010_. Prior to HBase 0.98, the default ports the master ui was deployed on port 16010, and the HBase RegionServers would listen on port 16020 by default and put up an informational http server at 16030. @@ -536,15 +536,15 @@ It can take longer if your cluster is comprised of many machines. If you are running a distributed operation, be sure to wait until HBase has shut down completely before stopping the Hadoop daemons. [[config.files]] -== Configuration Files +== Default Configuration [[hbase.site]] -=== [path]_hbase-site.xml_ and [path]_hbase-default.xml_ +=== _hbase-site.xml_ and _hbase-default.xml_ -Just as in Hadoop where you add site-specific HDFS configuration to the [path]_hdfs-site.xml_ file, for HBase, site specific customizations go into the file [path]_conf/hbase-site.xml_. -For the list of configurable properties, see <<hbase_default_configurations,hbase default configurations>> below or view the raw [path]_hbase-default.xml_ source file in the HBase source code at [path]_src/main/resources_. +Just as in Hadoop where you add site-specific HDFS configuration to the _hdfs-site.xml_ file, for HBase, site specific customizations go into the file _conf/hbase-site.xml_. +For the list of configurable properties, see <<hbase_default_configurations,hbase default configurations>> below or view the raw _hbase-default.xml_ source file in the HBase source code at _src/main/resources_. -Not all configuration options make it out to [path]_hbase-default.xml_. +Not all configuration options make it out to _hbase-default.xml_. Configuration that it is thought rare anyone would change can exist only in code; the only way to turn up such configurations is via a reading of the source code itself. Currently, changes here will require a cluster restart for HBase to notice the change. @@ -554,19 +554,19 @@ include::../../../../target/asciidoc/hbase-default.adoc[] [[hbase.env.sh]] -=== [path]_hbase-env.sh_ +=== _hbase-env.sh_ Set HBase environment variables in this file. Examples include options to pass the JVM on start of an HBase daemon such as heap size and garbage collector configs. You can also set configurations for HBase configuration, log directories, niceness, ssh options, where to locate process pid files, etc. -Open the file at [path]_conf/hbase-env.sh_ and peruse its content. +Open the file at _conf/hbase-env.sh_ and peruse its content. Each option is fairly well documented. Add your own environment variables here if you want them read by HBase daemons on startup. Changes here will require a cluster restart for HBase to notice the change. [[log4j]] -=== [path]_log4j.properties_ +=== _log4j.properties_ Edit this file to change rate at which HBase files are rolled and to change the level at which HBase logs messages. @@ -580,11 +580,11 @@ If you are running HBase in standalone mode, you don't need to configure anythin Since the HBase Master may move around, clients bootstrap by looking to ZooKeeper for current critical locations. ZooKeeper is where all these values are kept. Thus clients require the location of the ZooKeeper ensemble information before they can do anything else. -Usually this the ensemble location is kept out in the [path]_hbase-site.xml_ and is picked up by the client from the [var]+CLASSPATH+. +Usually this the ensemble location is kept out in the _hbase-site.xml_ and is picked up by the client from the `CLASSPATH`. -If you are configuring an IDE to run a HBase client, you should include the [path]_conf/_ directory on your classpath so [path]_hbase-site.xml_ settings can be found (or add [path]_src/test/resources_ to pick up the hbase-site.xml used by tests). +If you are configuring an IDE to run a HBase client, you should include the _conf/_ directory on your classpath so _hbase-site.xml_ settings can be found (or add _src/test/resources_ to pick up the hbase-site.xml used by tests). -Minimally, a client of HBase needs several libraries in its [var]+CLASSPATH+ when connecting to a cluster, including: +Minimally, a client of HBase needs several libraries in its `CLASSPATH` when connecting to a cluster, including: [source] ---- @@ -599,10 +599,9 @@ slf4j-log4j (slf4j-log4j12-1.5.8.jar) zookeeper (zookeeper-3.4.2.jar) ---- -An example basic [path]_hbase-site.xml_ for client only might look as follows: +An example basic _hbase-site.xml_ for client only might look as follows: [source,xml] ---- - <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> @@ -620,7 +619,7 @@ An example basic [path]_hbase-site.xml_ for client only might look as follows: The configuration used by a Java client is kept in an link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HBaseConfiguration[HBaseConfiguration] instance. -The factory method on HBaseConfiguration, [code]+HBaseConfiguration.create();+, on invocation, will read in the content of the first [path]_hbase-site.xml_ found on the client's [var]+CLASSPATH+, if one is present (Invocation will also factor in any [path]_hbase-default.xml_ found; an hbase-default.xml ships inside the [path]_hbase.X.X.X.jar_). It is also possible to specify configuration directly without having to read from a [path]_hbase-site.xml_. +The factory method on HBaseConfiguration, `HBaseConfiguration.create();`, on invocation, will read in the content of the first _hbase-site.xml_ found on the client's `CLASSPATH`, if one is present (Invocation will also factor in any _hbase-default.xml_ found; an hbase-default.xml ships inside the _hbase.X.X.X.jar_). It is also possible to specify configuration directly without having to read from a _hbase-site.xml_. For example, to set the ZooKeeper ensemble for the cluster programmatically do as follows: [source,java] @@ -629,7 +628,7 @@ Configuration config = HBaseConfiguration.create(); config.set("hbase.zookeeper.quorum", "localhost"); // Here we are running zookeeper locally ---- -If multiple ZooKeeper instances make up your ZooKeeper ensemble, they may be specified in a comma-separated list (just as in the [path]_hbase-site.xml_ file). This populated [class]+Configuration+ instance can then be passed to an link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html[HTable], and so on. +If multiple ZooKeeper instances make up your ZooKeeper ensemble, they may be specified in a comma-separated list (just as in the _hbase-site.xml_ file). This populated `Configuration` instance can then be passed to an link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html[HTable], and so on. [[example_config]] == Example Configurations @@ -637,20 +636,18 @@ If multiple ZooKeeper instances make up your ZooKeeper ensemble, they may be spe === Basic Distributed HBase Install Here is an example basic configuration for a distributed ten node cluster. -The nodes are named [var]+example0+, [var]+example1+, etc., through node [var]+example9+ in this example. -The HBase Master and the HDFS namenode are running on the node [var]+example0+. -RegionServers run on nodes [var]+example1+-[var]+example9+. -A 3-node ZooKeeper ensemble runs on [var]+example1+, [var]+example2+, and [var]+example3+ on the default ports. -ZooKeeper data is persisted to the directory [path]_/export/zookeeper_. -Below we show what the main configuration files -- [path]_hbase-site.xml_, [path]_regionservers_, and [path]_hbase-env.sh_ -- found in the HBase [path]_conf_ directory might look like. +The nodes are named `example0`, `example1`, etc., through node `example9` in this example. +The HBase Master and the HDFS namenode are running on the node `example0`. +RegionServers run on nodes `example1`-`example9`. +A 3-node ZooKeeper ensemble runs on `example1`, `example2`, and `example3` on the default ports. +ZooKeeper data is persisted to the directory _/export/zookeeper_. +Below we show what the main configuration files -- _hbase-site.xml_, _regionservers_, and _hbase-env.sh_ -- found in the HBase _conf_ directory might look like. [[hbase_site]] -==== [path]_hbase-site.xml_ +==== _hbase-site.xml_ -[source,bourne] +[source,xml] ---- - - <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> @@ -685,14 +682,13 @@ Below we show what the main configuration files -- [path]_hbase-site.xml_, [path ---- [[regionservers]] -==== [path]_regionservers_ +==== _regionservers_ In this file you list the nodes that will run RegionServers. -In our case, these nodes are [var]+example1+-[var]+example9+. +In our case, these nodes are `example1`-`example9`. [source] ---- - example1 example2 example3 @@ -705,12 +701,12 @@ example9 ---- [[hbase_env]] -==== [path]_hbase-env.sh_ +==== _hbase-env.sh_ -The following lines in the [path]_hbase-env.sh_ file show how to set the [var]+JAVA_HOME+ environment variable (required for HBase 0.98.5 and newer) and set the heap to 4 GB (rather than the default value of 1 GB). If you copy and paste this example, be sure to adjust the [var]+JAVA_HOME+ to suit your environment. +The following lines in the _hbase-env.sh_ file show how to set the `JAVA_HOME` environment variable (required for HBase 0.98.5 and newer) and set the heap to 4 GB (rather than the default value of 1 GB). If you copy and paste this example, be sure to adjust the `JAVA_HOME` to suit your environment. +[source,bash] ---- - # The java implementation to use. export JAVA_HOME=/usr/java/jdk1.7.0/ @@ -718,7 +714,7 @@ export JAVA_HOME=/usr/java/jdk1.7.0/ export HBASE_HEAPSIZE=4096 ---- -Use +rsync+ to copy the content of the [path]_conf_ directory to all nodes of the cluster. +Use +rsync+ to copy the content of the _conf_ directory to all nodes of the cluster. [[important_configurations]] == The Important Configurations @@ -736,7 +732,7 @@ Review the <<os,os>> and <<hadoop,hadoop>> sections. If a cluster with a lot of regions, it is possible if an eager beaver regionserver checks in soon after master start while all the rest in the cluster are laggardly, this first server to checkin will be assigned all regions. If lots of regions, this first server could buckle under the load. -To prevent the above scenario happening up the [var]+hbase.master.wait.on.regionservers.mintostart+ from its default value of 1. +To prevent the above scenario happening up the `hbase.master.wait.on.regionservers.mintostart` from its default value of 1. See link:https://issues.apache.org/jira/browse/HBASE-6389[HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments] for more detail. @@ -756,13 +752,13 @@ See the configuration <<fail.fast.expired.active.master,fail.fast.expired.active ==== ZooKeeper Configuration [[sect.zookeeper.session.timeout]] -===== [var]+zookeeper.session.timeout+ +===== `zookeeper.session.timeout` The default timeout is three minutes (specified in milliseconds). This means that if a server crashes, it will be three minutes before the Master notices the crash and starts recovery. You might like to tune the timeout down to a minute or even less so the Master notices failures the sooner. Before changing this value, be sure you have your JVM garbage collection configuration under control otherwise, a long garbage collection that lasts beyond the ZooKeeper session timeout will take out your RegionServer (You might be fine with this -- you probably want recovery to start on the server if a RegionServer has been in GC for a long period of time). -To change this configuration, edit [path]_hbase-site.xml_, copy the changed file around the cluster and restart. +To change this configuration, edit _hbase-site.xml_, copy the changed file around the cluster and restart. We set this value high to save our having to field noob questions up on the mailing lists asking why a RegionServer went down during a massive import. The usual cause is that their JVM is untuned and they are running into long GC pauses. @@ -781,11 +777,11 @@ See <<zookeeper,zookeeper>>. ===== dfs.datanode.failed.volumes.tolerated This is the "...number of volumes that are allowed to fail before a datanode stops offering service. -By default any volume failure will cause a datanode to shutdown" from the [path]_hdfs-default.xml_ description. +By default any volume failure will cause a datanode to shutdown" from the _hdfs-default.xml_ description. If you have > three or four disks, you might want to set this to 1 or if you have many disks, two or more. [[hbase.regionserver.handler.count_description]] -==== [var]+hbase.regionserver.handler.count+ +==== `hbase.regionserver.handler.count` This setting defines the number of threads that are kept open to answer incoming requests to user tables. The rule of thumb is to keep this number low when the payload per request approaches the MB (big puts, scans using a large cache) and high when the payload is small (gets, small puts, ICVs, deletes). The total size of the queries in progress is limited by the setting "hbase.ipc.server.max.callqueue.size". @@ -826,9 +822,9 @@ However, as all memstores are not expected to be full all the time, less WAL fil [[disable.splitting]] ==== Managed Splitting -HBase generally handles splitting your regions, based upon the settings in your [path]_hbase-default.xml_ and [path]_hbase-site.xml_ configuration files. -Important settings include [var]+hbase.regionserver.region.split.policy+, [var]+hbase.hregion.max.filesize+, [var]+hbase.regionserver.regionSplitLimit+. -A simplistic view of splitting is that when a region grows to [var]+hbase.hregion.max.filesize+, it is split. +HBase generally handles splitting your regions, based upon the settings in your _hbase-default.xml_ and _hbase-site.xml_ configuration files. +Important settings include `hbase.regionserver.region.split.policy`, `hbase.hregion.max.filesize`, `hbase.regionserver.regionSplitLimit`. +A simplistic view of splitting is that when a region grows to `hbase.hregion.max.filesize`, it is split. For most use patterns, most of the time, you should use automatic splitting. See <<manual_region_splitting_decisions,manual region splitting decisions>> for more information about manual region splitting. @@ -839,7 +835,7 @@ Manual splitting can mitigate region creation and movement under load. It also makes it so region boundaries are known and invariant (if you disable region splitting). If you use manual splits, it is easier doing staggered, time-based major compactions spread out your network IO load. .Disable Automatic Splitting -To disable automatic splitting, set [var]+hbase.hregion.max.filesize+ to a very large value, such as [literal]+100 GB+ It is not recommended to set it to its absolute maximum value of [literal]+Long.MAX_VALUE+. +To disable automatic splitting, set `hbase.hregion.max.filesize` to a very large value, such as `100 GB` It is not recommended to set it to its absolute maximum value of `Long.MAX_VALUE`. .Automatic Splitting Is Recommended [NOTE] @@ -858,8 +854,8 @@ The goal is for the largest region to be just large enough that the compaction s Otherwise, the cluster can be prone to compaction storms where a large number of regions under compaction at the same time. It is important to understand that the data growth causes compaction storms, and not the manual split decision. -If the regions are split into too many large regions, you can increase the major compaction interval by configuring [var]+HConstants.MAJOR_COMPACTION_PERIOD+. -HBase 0.90 introduced [class]+org.apache.hadoop.hbase.util.RegionSplitter+, which provides a network-IO-safe rolling split of all regions. +If the regions are split into too many large regions, you can increase the major compaction interval by configuring `HConstants.MAJOR_COMPACTION_PERIOD`. +HBase 0.90 introduced `org.apache.hadoop.hbase.util.RegionSplitter`, which provides a network-IO-safe rolling split of all regions. [[managed.compactions]] ==== Managed Compactions @@ -868,7 +864,7 @@ By default, major compactions are scheduled to run once in a 7-day period. Prior to HBase 0.96.x, major compactions were scheduled to happen once per day by default. If you need to control exactly when and how often major compaction runs, you can disable managed major compactions. -See the entry for [var]+hbase.hregion.majorcompaction+ in the <<compaction.parameters,compaction.parameters>> table for details. +See the entry for `hbase.hregion.majorcompaction` in the <<compaction.parameters,compaction.parameters>> table for details. .Do Not Disable Major Compactions [WARNING] @@ -885,7 +881,7 @@ For more information about compactions and the compaction file selection process ==== Speculative Execution Speculative Execution of MapReduce tasks is on by default, and for HBase clusters it is generally advised to turn off Speculative Execution at a system-level unless you need it for a specific case, where it can be configured per-job. -Set the properties [var]+mapreduce.map.speculative+ and [var]+mapreduce.reduce.speculative+ to false. +Set the properties `mapreduce.map.speculative` and `mapreduce.reduce.speculative` to false. [[other_configuration]] === Other Configurations @@ -894,14 +890,14 @@ Set the properties [var]+mapreduce.map.speculative+ and [var]+mapreduce.reduce.s ==== Balancer The balancer is a periodic operation which is run on the master to redistribute regions on the cluster. -It is configured via [var]+hbase.balancer.period+ and defaults to 300000 (5 minutes). +It is configured via `hbase.balancer.period` and defaults to 300000 (5 minutes). See <<master.processes.loadbalancer,master.processes.loadbalancer>> for more information on the LoadBalancer. [[disabling.blockcache]] ==== Disabling Blockcache -Do not turn off block cache (You'd do it by setting [var]+hbase.block.cache.size+ to zero). Currently we do not do well if you do this because the regionserver will spend all its time loading hfile indices over and over again. +Do not turn off block cache (You'd do it by setting `hbase.block.cache.size` to zero). Currently we do not do well if you do this because the regionserver will spend all its time loading hfile indices over and over again. If your working set it such that block cache does you no good, at least size the block cache such that hfile indices will stay up in the cache (you can get a rough idea on the size you need by surveying regionserver UIs; you'll see index block size accounted near the top of the webpage). [[nagles]] @@ -925,8 +921,6 @@ HDFS-3703, HDFS-3712, and HDFS-4791 -- hadoop 2 for sure has them and late hadoo [source,xml] ---- - -<property> <property> <name>hbase.lease.recovery.dfs.timeout</name> <value>23000</value> @@ -944,7 +938,6 @@ And on the namenode/datanode side, set the following to enable 'staleness' intro [source,xml] ---- - <property> <name>dfs.client.socket-timeout</name> <value>10000</value> @@ -991,7 +984,7 @@ See link:http://docs.oracle.com/javase/6/docs/technotes/guides/management/agent. Historically, besides above port mentioned, JMX opens 2 additional random TCP listening ports, which could lead to port conflict problem.(See link:https://issues.apache.org/jira/browse/HBASE-10289[HBASE-10289] for details) As an alternative, You can use the coprocessor-based JMX implementation provided by HBase. -To enable it in 0.99 or above, add below property in [path]_hbase-site.xml_: +To enable it in 0.99 or above, add below property in _hbase-site.xml_: [source,xml] ---- @@ -1009,7 +1002,6 @@ The reason why you only configure coprocessor for 'regionserver' is that, starti [source,xml] ---- - <property> <name>regionserver.rmi.registry.port</name> <value>61130</value> @@ -1024,7 +1016,8 @@ The registry port can be shared with connector port in most cases, so you only n However if you want to use SSL communication, the 2 ports must be configured to different values. By default the password authentication and SSL communication is disabled. -To enable password authentication, you need to update [path]_hbase-env.sh_ like below: +To enable password authentication, you need to update _hbase-env.sh_ like below: +[source,bash] ---- export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.authenticate=true \ -Dcom.sun.management.jmxremote.password.file=your_password_file \ @@ -1038,6 +1031,7 @@ See example password/access file under $JRE_HOME/lib/management. To enable SSL communication with password authentication, follow below steps: +[source,bash] ---- #1. generate a key pair, stored in myKeyStore keytool -genkey -alias jconsole -keystore myKeyStore @@ -1049,10 +1043,10 @@ keytool -export -alias jconsole -keystore myKeyStore -file jconsole.cert keytool -import -alias jconsole -keystore jconsoleKeyStore -file jconsole.cert ---- -And then update [path]_hbase-env.sh_ like below: +And then update _hbase-env.sh_ like below: +[source,bash] ---- - export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=true \ -Djavax.net.ssl.keyStore=/home/tianq/myKeyStore \ -Djavax.net.ssl.keyStorePassword=your_password_in_step_1 \ @@ -1066,11 +1060,12 @@ export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE " Finally start jconsole on client using the key store: +[source,bash] ---- jconsole -J-Djavax.net.ssl.trustStore=/home/tianq/jconsoleKeyStore ---- -NOTE: for HBase 0.98, To enable the HBase JMX implementation on Master, you also need to add below property in [path]_hbase-site.xml_: +NOTE: for HBase 0.98, To enable the HBase JMX implementation on Master, you also need to add below property in _hbase-site.xml_: [source,xml] ---- http://git-wip-us.apache.org/repos/asf/hbase/blob/5fbf80ee/src/main/asciidoc/_chapters/cp.adoc ---------------------------------------------------------------------- diff --git a/src/main/asciidoc/_chapters/cp.adoc b/src/main/asciidoc/_chapters/cp.adoc index 032275a..96f1c2f 100644 --- a/src/main/asciidoc/_chapters/cp.adoc +++ b/src/main/asciidoc/_chapters/cp.adoc @@ -69,7 +69,7 @@ Endpoints (HBase 0.94.x and earlier):: == Examples -An example of an observer is included in [path]_hbase-examples/src/test/java/org/apache/hadoop/hbase/coprocessor/example/TestZooKeeperScanPolicyObserver.java_. +An example of an observer is included in _hbase-examples/src/test/java/org/apache/hadoop/hbase/coprocessor/example/TestZooKeeperScanPolicyObserver.java_. Several endpoint examples are included in the same directory. == Building A Coprocessor @@ -80,11 +80,11 @@ You can load the coprocessor from your HBase configuration, so that the coproces === Load from Configuration -To configure a coprocessor to be loaded when HBase starts, modify the RegionServer's [path]_hbase-site.xml_ and configure one of the following properties, based on the type of observer you are configuring: +To configure a coprocessor to be loaded when HBase starts, modify the RegionServer's _hbase-site.xml_ and configure one of the following properties, based on the type of observer you are configuring: -* [code]+hbase.coprocessor.region.classes+for RegionObservers and Endpoints -* [code]+hbase.coprocessor.wal.classes+for WALObservers -* [code]+hbase.coprocessor.master.classes+for MasterObservers +* `hbase.coprocessor.region.classes`for RegionObservers and Endpoints +* `hbase.coprocessor.wal.classes`for WALObservers +* `hbase.coprocessor.master.classes`for MasterObservers .Example RegionObserver Configuration ==== @@ -105,7 +105,7 @@ Therefore, the jar file must reside on the server-side HBase classpath. Coprocessors which are loaded in this way will be active on all regions of all tables. These are the system coprocessor introduced earlier. -The first listed coprocessors will be assigned the priority [literal]+Coprocessor.Priority.SYSTEM+. +The first listed coprocessors will be assigned the priority `Coprocessor.Priority.SYSTEM`. Each subsequent coprocessor in the list will have its priority value incremented by one (which reduces its priority, because priorities have the natural sort order of Integers). When calling out to registered observers, the framework executes their callbacks methods in the sorted order of their priority. @@ -145,7 +145,7 @@ DESCRIPTION ENABLED ==== The coprocessor framework will try to read the class information from the coprocessor table attribute value. -The value contains four pieces of information which are separated by the [literal]+|+ character. +The value contains four pieces of information which are separated by the `|` character. * File path: The jar file containing the coprocessor implementation must be in a location where all region servers can read it. You could copy the file onto the local disk on each region server, but it is recommended to store it in HDFS. http://git-wip-us.apache.org/repos/asf/hbase/blob/5fbf80ee/src/main/asciidoc/_chapters/datamodel.adoc ---------------------------------------------------------------------- diff --git a/src/main/asciidoc/_chapters/datamodel.adoc b/src/main/asciidoc/_chapters/datamodel.adoc index 8bca69b..854d784 100644 --- a/src/main/asciidoc/_chapters/datamodel.adoc +++ b/src/main/asciidoc/_chapters/datamodel.adoc @@ -31,7 +31,8 @@ In HBase, data is stored in tables, which have rows and columns. This is a terminology overlap with relational databases (RDBMSs), but this is not a helpful analogy. Instead, it can be helpful to think of an HBase table as a multi-dimensional map. -.HBase Data Model TerminologyTable:: +.HBase Data Model Terminology +Table:: An HBase table consists of multiple rows. Row:: @@ -43,7 +44,7 @@ Row:: If your row keys are domains, you should probably store them in reverse (org.apache.www, org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each other in the table, rather than being spread out based on the first letter of the subdomain. Column:: - A column in HBase consists of a column family and a column qualifier, which are delimited by a [literal]+:+ (colon) character. + A column in HBase consists of a column family and a column qualifier, which are delimited by a `:` (colon) character. Column Family:: Column families physically colocate a set of columns and their values, often for performance reasons. @@ -52,7 +53,7 @@ Column Family:: Column Qualifier:: A column qualifier is added to a column family to provide the index for a given piece of data. - Given a column family [literal]+content+, a column qualifier might be [literal]+content:html+, and another might be [literal]+content:pdf+. + Given a column family `content`, a column qualifier might be `content:html`, and another might be `content:pdf`. Though column families are fixed at table creation, column qualifiers are mutable and may differ greatly between rows. Cell:: @@ -65,48 +66,39 @@ Timestamp:: [[conceptual.view]] == Conceptual View -You can read a very understandable explanation of the HBase data model in the blog post link:http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable[Understanding - HBase and BigTable] by Jim R. -Wilson. -Another good explanation is available in the PDF link:http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf[Introduction - to Basic Schema Design] by Amandeep Khurana. +You can read a very understandable explanation of the HBase data model in the blog post link:http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable[Understanding HBase and BigTable] by Jim R. Wilson. + +Another good explanation is available in the PDF link:http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf[Introduction +to Basic Schema Design] by Amandeep Khurana. + It may help to read different perspectives to get a solid understanding of HBase schema design. The linked articles cover the same ground as the information in this section. The following example is a slightly modified form of the one on page 2 of the link:http://research.google.com/archive/bigtable.html[BigTable] paper. -There is a table called [var]+webtable+ that contains two rows ([literal]+com.cnn.www+ and [literal]+com.example.www+), three column families named [var]+contents+, [var]+anchor+, and [var]+people+. -In this example, for the first row ([literal]+com.cnn.www+), [var]+anchor+ contains two columns ([var]+anchor:cssnsi.com+, [var]+anchor:my.look.ca+) and [var]+contents+ contains one column ([var]+contents:html+). This example contains 5 versions of the row with the row key [literal]+com.cnn.www+, and one version of the row with the row key [literal]+com.example.www+. -The [var]+contents:html+ column qualifier contains the entire HTML of a given website. -Qualifiers of the [var]+anchor+ column family each contain the external site which links to the site represented by the row, along with the text it used in the anchor of its link. -The [var]+people+ column family represents people associated with the site. +There is a table called `webtable` that contains two rows (`com.cnn.www` and `com.example.www`), three column families named `contents`, `anchor`, and `people`. +In this example, for the first row (`com.cnn.www`), `anchor` contains two columns (`anchor:cssnsi.com`, `anchor:my.look.ca`) and `contents` contains one column (`contents:html`). This example contains 5 versions of the row with the row key `com.cnn.www`, and one version of the row with the row key `com.example.www`. +The `contents:html` column qualifier contains the entire HTML of a given website. +Qualifiers of the `anchor` column family each contain the external site which links to the site represented by the row, along with the text it used in the anchor of its link. +The `people` column family represents people associated with the site. .Column Names [NOTE] ==== By convention, a column name is made of its column family prefix and a _qualifier_. -For example, the column _contents:html_ is made up of the column family [var]+contents+ and the [var]+html+ qualifier. -The colon character ([literal]+:+) delimits the column family from the column family _qualifier_. +For example, the column _contents:html_ is made up of the column family `contents` and the `html` qualifier. +The colon character (`:`) delimits the column family from the column family _qualifier_. ==== -.Table [var]+webtable+ +.Table `webtable` [cols="1,1,1,1,1", frame="all", options="header"] |=== -| Row Key -| Time Stamp -| ColumnFamily contents -| ColumnFamily anchor -| ColumnFamily people -| anchor:cnnsi.com = "CNN" - -| anchor:my.look.ca = "CNN.com" - -| contents:html = "<html>..." - -| contents:html = "<html>..." - -| contents:html = "<html>..." - -| contents:html = "<html>..." +|Row Key |Time Stamp |ColumnFamily `contents` |ColumnFamily `anchor`|ColumnFamily `people` +|"com.cnn.www" |t9 | |anchor:cnnsi.com = "CNN" | +|"com.cnn.www" |t8 | |anchor:my.look.ca = "CNN.com" | +|"com.cnn.www" |t6 | contents:html = "<html>..." | | +|"com.cnn.www" |t5 | contents:html = "<html>..." | | +|"com.cnn.www" |t3 | contents:html = "<html>..." | | +|"com.example.www"| t5 | contents:html = "<html>..." | people:author = "John Doe" |=== Cells in this table that appear to be empty do not take space, or in fact exist, in HBase. @@ -114,9 +106,8 @@ This is what makes HBase "sparse." A tabular view is not the only possible way t The following represents the same information as a multi-dimensional map. This is only a mock-up for illustrative purposes and may not be strictly accurate. -[source] +[source,json] ---- - { "com.cnn.www": { contents: { @@ -148,36 +139,31 @@ This is only a mock-up for illustrative purposes and may not be strictly accurat Although at a conceptual level tables may be viewed as a sparse set of rows, they are physically stored by column family. A new column qualifier (column_family:column_qualifier) can be added to an existing column family at any time. -.ColumnFamily [var]+anchor+ +.ColumnFamily `anchor` [cols="1,1,1", frame="all", options="header"] |=== -| Row Key -| Time Stamp -| Column Family anchor -| anchor:cnnsi.com = "CNN" - -| anchor:my.look.ca = "CNN.com" +|Row Key | Time Stamp |Column Family `anchor` +|"com.cnn.www" |t9 |`anchor:cnnsi.com = "CNN"` +|"com.cnn.www" |t8 |`anchor:my.look.ca = "CNN.com"` |=== -.ColumnFamily [var]+contents+ + +.ColumnFamily `contents` [cols="1,1,1", frame="all", options="header"] |=== -| Row Key -| Time Stamp -| ColumnFamily "contents:" -| contents:html = "<html>..." - -| contents:html = "<html>..." - -| contents:html = "<html>..." +|Row Key |Time Stamp |ColumnFamily `contents:` +|"com.cnn.www" |t6 |contents:html = "<html>..." +|"com.cnn.www" |t5 |contents:html = "<html>..." +|"com.cnn.www" |t3 |contents:html = "<html>..." |=== + The empty cells shown in the conceptual view are not stored at all. -Thus a request for the value of the [var]+contents:html+ column at time stamp [literal]+t8+ would return no value. -Similarly, a request for an [var]+anchor:my.look.ca+ value at time stamp [literal]+t9+ would return no value. +Thus a request for the value of the `contents:html` column at time stamp `t8` would return no value. +Similarly, a request for an `anchor:my.look.ca` value at time stamp `t9` would return no value. However, if no timestamp is supplied, the most recent value for a particular column would be returned. Given multiple versions, the most recent is also the first one found, since timestamps are stored in descending order. -Thus a request for the values of all columns in the row [var]+com.cnn.www+ if no timestamp is specified would be: the value of [var]+contents:html+ from timestamp [literal]+t6+, the value of [var]+anchor:cnnsi.com+ from timestamp [literal]+t9+, the value of [var]+anchor:my.look.ca+ from timestamp [literal]+t8+. +Thus a request for the values of all columns in the row `com.cnn.www` if no timestamp is specified would be: the value of `contents:html` from timestamp `t6`, the value of `anchor:cnnsi.com` from timestamp `t9`, the value of `anchor:my.look.ca` from timestamp `t8`. For more information about the internals of how Apache HBase stores data, see <<regions.arch,regions.arch>>. @@ -269,7 +255,7 @@ The empty byte array is used to denote both the start and end of a tables' names Columns in Apache HBase are grouped into _column families_. All column members of a column family have the same prefix. For example, the columns _courses:history_ and _courses:math_ are both members of the _courses_ column family. -The colon character ([literal]+:+) delimits the column family from the +The colon character (`:`) delimits the column family from the column family qualifier. The column family prefix must be composed of _printable_ characters. The qualifying tail, the column family _qualifier_, can be made of any arbitrary bytes. @@ -280,7 +266,7 @@ Because tunings and storage specifications are done at the column family level, == Cells -A _{row, column, version}_ tuple exactly specifies a [literal]+cell+ in HBase. +A _{row, column, version}_ tuple exactly specifies a `cell` in HBase. Cell content is uninterrpreted bytes == Data Model Operations @@ -344,15 +330,15 @@ See <<version.delete,version.delete>> for more information on deleting versions [[versions]] == Versions -A _{row, column, version}_ tuple exactly specifies a [literal]+cell+ in HBase. +A _{row, column, version}_ tuple exactly specifies a `cell` in HBase. It's possible to have an unbounded number of cells where the row and column are the same but the cell address differs only in its version dimension. While rows and column keys are expressed as bytes, the version is specified using a long integer. -Typically this long contains time instances such as those returned by [code]+java.util.Date.getTime()+ or [code]+System.currentTimeMillis()+, that is: [quote]_the difference, measured in milliseconds, between the current time and midnight, January 1, 1970 UTC_. +Typically this long contains time instances such as those returned by `java.util.Date.getTime()` or `System.currentTimeMillis()`, that is: [quote]_the difference, measured in milliseconds, between the current time and midnight, January 1, 1970 UTC_. The HBase version dimension is stored in decreasing order, so that when reading from a store file, the most recent values are found first. -There is a lot of confusion over the semantics of [literal]+cell+ versions, in HBase. +There is a lot of confusion over the semantics of `cell` versions, in HBase. In particular: * If multiple writes to a cell have the same version, only the last written is fetchable. @@ -367,12 +353,12 @@ This section is basically a synopsis of this article by Bruno Dumon. [[specify.number.of.versions]] === Specifying the Number of Versions to Store -The maximum number of versions to store for a given column is part of the column schema and is specified at table creation, or via an +alter+ command, via [code]+HColumnDescriptor.DEFAULT_VERSIONS+. -Prior to HBase 0.96, the default number of versions kept was [literal]+3+, but in 0.96 and newer has been changed to [literal]+1+. +The maximum number of versions to store for a given column is part of the column schema and is specified at table creation, or via an +alter+ command, via `HColumnDescriptor.DEFAULT_VERSIONS`. +Prior to HBase 0.96, the default number of versions kept was `3`, but in 0.96 and newer has been changed to `1`. .Modify the Maximum Number of Versions for a Column ==== -This example uses HBase Shell to keep a maximum of 5 versions of column [code]+f1+. +This example uses HBase Shell to keep a maximum of 5 versions of column `f1`. You could also use link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor]. ---- @@ -384,7 +370,7 @@ hbase> alter ât1â², NAME => âf1â², VERSIONS => 5 ==== You can also specify the minimum number of versions to store. By default, this is set to 0, which means the feature is disabled. -The following example sets the minimum number of versions on field [code]+f1+ to [literal]+2+, via HBase Shell. +The following example sets the minimum number of versions on field `f1` to `2`, via HBase Shell. You could also use link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor]. ---- @@ -392,7 +378,7 @@ hbase> alter ât1â², NAME => âf1â², MIN_VERSIONS => 2 ---- ==== -Starting with HBase 0.98.2, you can specify a global default for the maximum number of versions kept for all newly-created columns, by setting +hbase.column.max.version+ in [path]_hbase-site.xml_. +Starting with HBase 0.98.2, you can specify a global default for the maximum number of versions kept for all newly-created columns, by setting +hbase.column.max.version+ in _hbase-site.xml_. See <<hbase.column.max.version,hbase.column.max.version>>. [[versions.ops]] @@ -406,7 +392,7 @@ Gets are implemented on top of Scans. The below discussion of link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html[Get] applies equally to link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scans]. By default, i.e. -if you specify no explicit version, when doing a [literal]+get+, the cell whose version has the largest value is returned (which may or may not be the latest one written, see later). The default behavior can be modified in the following ways: +if you specify no explicit version, when doing a `get`, the cell whose version has the largest value is returned (which may or may not be the latest one written, see later). The default behavior can be modified in the following ways: * to return more than one version, see link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html#setMaxVersions()[Get.setMaxVersions()] * to return versions other than the latest, see link:???[Get.setTimeRange()] @@ -448,8 +434,8 @@ List<KeyValue> kv = r.getColumn(CF, ATTR); // returns all versions of this colu ==== Put -Doing a put always creates a new version of a [literal]+cell+, at a certain timestamp. -By default the system uses the server's [literal]+currentTimeMillis+, but you can specify the version (= the long integer) yourself, on a per-column level. +Doing a put always creates a new version of a `cell`, at a certain timestamp. +By default the system uses the server's `currentTimeMillis`, but you can specify the version (= the long integer) yourself, on a per-column level. This means you could assign a time in the past or the future, or use the long value for non-time purposes. To overwrite an existing value, do a put at exactly the same row, column, and version as that of the cell you would overshadow. @@ -504,7 +490,7 @@ When deleting an entire row, HBase will internally create a tombstone for each C Deletes work by creating _tombstone_ markers. For example, let's suppose we want to delete a row. -For this you can specify a version, or else by default the [literal]+currentTimeMillis+ is used. +For this you can specify a version, or else by default the `currentTimeMillis` is used. What this means is [quote]_delete all cells where the version is less than or equal to this version_. HBase never modifies data in place, so for example a delete will not immediately delete (or mark as deleted) the entries in the storage file that correspond to the delete condition. @@ -518,7 +504,7 @@ For an informative discussion on how deletes and versioning interact, see the th Also see <<keyvalue,keyvalue>> for more information on the internal KeyValue format. Delete markers are purged during the next major compaction of the store, unless the +KEEP_DELETED_CELLS+ option is set in the column family. -To keep the deletes for a configurable amount of time, you can set the delete TTL via the +hbase.hstore.time.to.purge.deletes+ property in [path]_hbase-site.xml_. +To keep the deletes for a configurable amount of time, you can set the delete TTL via the +hbase.hstore.time.to.purge.deletes+ property in _hbase-site.xml_. If +hbase.hstore.time.to.purge.deletes+ is not set, or set to 0, all delete markers, including those with timestamps in the future, are purged during the next major compaction. Otherwise, a delete marker with a timestamp in the future is kept until the major compaction which occurs after the time represented by the marker's timestamp plus the value of +hbase.hstore.time.to.purge.deletes+, in milliseconds.
