http://git-wip-us.apache.org/repos/asf/hbase-site/blob/9fb0764b/book.html ---------------------------------------------------------------------- diff --git a/book.html b/book.html index 9eeedb5..3cf6127 100644 --- a/book.html +++ b/book.html @@ -1381,11 +1381,10 @@ To check for well-formedness and only print output if errors exist, use the comm <td class="content"> <div class="title">Keep Configuration In Sync Across the Cluster</div> <div class="paragraph"> -<p>When running in distributed mode, after you make an edit to an HBase configuration, make sure you copy the content of the <em>conf/</em> directory to all nodes of the cluster. +<p>When running in distributed mode, after you make an edit to an HBase configuration, make sure you copy the contents of the <em>conf/</em> directory to all nodes of the cluster. HBase will not do this for you. Use <code>rsync</code>, <code>scp</code>, or another secure mechanism for copying the configuration files to your nodes. -For most configuration, a restart is needed for servers to pick up changes An exception is dynamic configuration. -to be described later below.</p> +For most configurations, a restart is needed for servers to pick up changes. Dynamic configuration is an exception to this, to be described later below.</p> </div> </td> </tr> @@ -1473,12 +1472,12 @@ You must set <code>JAVA_HOME</code> on each node of your cluster. <em>hbase-env. </dd> <dt class="hdlist1">Loopback IP</dt> <dd> -<p>Prior to hbase-0.96.0, HBase only used the IP address <code>127.0.0.1</code> to refer to <code>localhost</code>, and this could not be configured. +<p>Prior to hbase-0.96.0, HBase only used the IP address <code>127.0.0.1</code> to refer to <code>localhost</code>, and this was not configurable. See <a href="#loopback.ip">Loopback IP</a> for more details.</p> </dd> <dt class="hdlist1">NTP</dt> <dd> -<p>The clocks on cluster nodes should be synchronized. A small amount of variation is acceptable, but larger amounts of skew can cause erratic and unexpected behavior. Time synchronization is one of the first things to check if you see unexplained problems in your cluster. It is recommended that you run a Network Time Protocol (NTP) service, or another time-synchronization mechanism, on your cluster, and that all nodes look to the same service for time synchronization. See the <a href="http://www.tldp.org/LDP/sag/html/basic-ntp-config.html">Basic NTP Configuration</a> at <em class="citetitle">The Linux Documentation Project (TLDP)</em> to set up NTP.</p> +<p>The clocks on cluster nodes should be synchronized. A small amount of variation is acceptable, but larger amounts of skew can cause erratic and unexpected behavior. Time synchronization is one of the first things to check if you see unexplained problems in your cluster. It is recommended that you run a Network Time Protocol (NTP) service, or another time-synchronization mechanism on your cluster and that all nodes look to the same service for time synchronization. See the <a href="http://www.tldp.org/LDP/sag/html/basic-ntp-config.html">Basic NTP Configuration</a> at <em class="citetitle">The Linux Documentation Project (TLDP)</em> to set up NTP.</p> </dd> </dl> </div> @@ -1540,8 +1539,8 @@ hadoop - nproc 32000</pre> </dd> <dt class="hdlist1">Windows</dt> <dd> -<p>Prior to HBase 0.96, testing for running HBase on Microsoft Windows was limited. -Running a on Windows nodes is not recommended for production systems.</p> +<p>Prior to HBase 0.96, running HBase on Microsoft Windows was limited only for testing purposes. +Running production systems on Windows machines is not recommended.</p> </dd> </dl> </div> @@ -1774,8 +1773,8 @@ data loss. This patch is present in Apache Hadoop releases 2.6.1+.</p> The bundled jar is ONLY for use in standalone mode. In distributed mode, it is <em>critical</em> that the version of Hadoop that is out on your cluster match what is under HBase. Replace the hadoop jar found in the HBase lib directory with the hadoop jar you are running on your cluster to avoid version mismatch issues. -Make sure you replace the jar in HBase everywhere on your cluster. -Hadoop version mismatch issues have various manifestations but often all looks like its hung up.</p> +Make sure you replace the jar in HBase across your whole cluster. +Hadoop version mismatch issues have various manifestations but often all look like its hung.</p> </div> </td> </tr> @@ -1860,7 +1859,7 @@ HDFS where data is replicated ensures the latter.</p> </div> <div class="paragraph"> <p>To configure this standalone variant, edit your <em>hbase-site.xml</em> -setting the <em>hbase.rootdir</em> to point at a directory in your +setting <em>hbase.rootdir</em> to point at a directory in your HDFS instance but then set <em>hbase.cluster.distributed</em> to <em>false</em>. For example:</p> </div> @@ -1912,8 +1911,8 @@ Some of the information that was originally in this section has been moved there </div> <div class="paragraph"> <p>A pseudo-distributed mode is simply a fully-distributed mode run on a single host. -Use this configuration testing and prototyping on HBase. -Do not use this configuration for production nor for evaluating HBase performance.</p> +Use this HBase configuration for testing and prototyping purposes only. +Do not use this configuration for production or for performance evaluation.</p> </div> </div> </div> @@ -1922,11 +1921,11 @@ Do not use this configuration for production nor for evaluating HBase performanc <div class="paragraph"> <p>By default, HBase runs in standalone mode. Both standalone mode and pseudo-distributed mode are provided for the purposes of small-scale testing. -For a production environment, distributed mode is appropriate. +For a production environment, distributed mode is advised. In distributed mode, multiple instances of HBase daemons run on multiple servers in the cluster.</p> </div> <div class="paragraph"> -<p>Just as in pseudo-distributed mode, a fully distributed configuration requires that you set the <code>hbase-cluster.distributed</code> property to <code>true</code>. +<p>Just as in pseudo-distributed mode, a fully distributed configuration requires that you set the <code>hbase.cluster.distributed</code> property to <code>true</code>. Typically, the <code>hbase.rootdir</code> is configured to point to a highly-available HDFS filesystem.</p> </div> <div class="paragraph"> @@ -2088,7 +2087,7 @@ For the list of configurable properties, see <a href="#hbase_default_configurati </div> <div class="paragraph"> <p>Not all configuration options make it out to <em>hbase-default.xml</em>. -Configuration that it is thought rare anyone would change can exist only in code; the only way to turn up such configurations is via a reading of the source code itself.</p> +Some configurations would only appear in source code; the only way to identify these changes are through code review.</p> </div> <div class="paragraph"> <p>Currently, changes here will require a cluster restart for HBase to notice the change.</p> @@ -5113,13 +5112,13 @@ Add your own environment variables here if you want them read by HBase daemons o <p>Since the HBase Master may move around, clients bootstrap by looking to ZooKeeper for current critical locations. ZooKeeper is where all these values are kept. Thus clients require the location of the ZooKeeper ensemble before they can do anything else. -Usually this the ensemble location is kept out in the <em>hbase-site.xml</em> and is picked up by the client from the <code>CLASSPATH</code>.</p> +Usually this ensemble location is kept out in the <em>hbase-site.xml</em> and is picked up by the client from the <code>CLASSPATH</code>.</p> </div> <div class="paragraph"> <p>If you are configuring an IDE to run an HBase client, you should include the <em>conf/</em> directory on your classpath so <em>hbase-site.xml</em> settings can be found (or add <em>src/test/resources</em> to pick up the hbase-site.xml used by tests).</p> </div> <div class="paragraph"> -<p>Minimally, a client of HBase needs several libraries in its <code>CLASSPATH</code> when connecting to a cluster, including:</p> +<p>Minimally, an HBase client needs several libraries in its <code>CLASSPATH</code> when connecting to a cluster, including:</p> </div> <div class="listingblock"> <div class="content"> @@ -5135,7 +5134,7 @@ zookeeper (zookeeper-<span class="float">3.4</span><span class="float">.2</span> </div> </div> <div class="paragraph"> -<p>An example basic <em>hbase-site.xml</em> for client only might look as follows:</p> +<p>A basic example <em>hbase-site.xml</em> for client only may look as follows:</p> </div> <div class="listingblock"> <div class="content"> @@ -5179,7 +5178,7 @@ config.set(<span class="string"><span class="delimiter">"</span><span class <div class="sect2"> <h3 id="_basic_distributed_hbase_install"><a class="anchor" href="#_basic_distributed_hbase_install"></a>8.1. Basic Distributed HBase Install</h3> <div class="paragraph"> -<p>Here is an example basic configuration for a distributed ten node cluster: +<p>Here is a basic configuration example for a distributed ten node cluster: * The nodes are named <code>example0</code>, <code>example1</code>, etc., through node <code>example9</code> in this example. * The HBase Master and the HDFS NameNode are running on the node <code>example0</code>. * RegionServers run on nodes <code>example1</code>-<code>example9</code>. @@ -5299,11 +5298,11 @@ See <a href="https://issues.apache.org/jira/browse/HBASE-6389">HBASE-6389 Modify <h5 id="sect.zookeeper.session.timeout"><a class="anchor" href="#sect.zookeeper.session.timeout"></a><code>zookeeper.session.timeout</code></h5> <div class="paragraph"> <p>The default timeout is three minutes (specified in milliseconds). This means that if a server crashes, it will be three minutes before the Master notices the crash and starts recovery. -You might like to tune the timeout down to a minute or even less so the Master notices failures the sooner. -Before changing this value, be sure you have your JVM garbage collection configuration under control otherwise, a long garbage collection that lasts beyond the ZooKeeper session timeout will take out your RegionServer (You might be fine with this — you probably want recovery to start on the server if a RegionServer has been in GC for a long period of time).</p> +You might need to tune the timeout down to a minute or even less so the Master notices failures sooner. +Before changing this value, be sure you have your JVM garbage collection configuration under control, otherwise, a long garbage collection that lasts beyond the ZooKeeper session timeout will take out your RegionServer. (You might be fine with this — you probably want recovery to start on the server if a RegionServer has been in GC for a long period of time).</p> </div> <div class="paragraph"> -<p>To change this configuration, edit <em>hbase-site.xml</em>, copy the changed file around the cluster and restart.</p> +<p>To change this configuration, edit <em>hbase-site.xml</em>, copy the changed file across the cluster and restart.</p> </div> <div class="paragraph"> <p>We set this value high to save our having to field questions up on the mailing lists asking why a RegionServer went down during a massive import. @@ -5322,16 +5321,15 @@ Later when they’ve built some confidence, then they can play with configur <div class="sect3"> <h4 id="recommended.configurations.hdfs"><a class="anchor" href="#recommended.configurations.hdfs"></a>9.2.2. HDFS Configurations</h4> <div class="sect4"> -<h5 id="dfs.datanode.failed.volumes.tolerated"><a class="anchor" href="#dfs.datanode.failed.volumes.tolerated"></a>dfs.datanode.failed.volumes.tolerated</h5> +<h5 id="dfs.datanode.failed.volumes.tolerated"><a class="anchor" href="#dfs.datanode.failed.volumes.tolerated"></a><code>dfs.datanode.failed.volumes.tolerated</code></h5> <div class="paragraph"> <p>This is the "…​number of volumes that are allowed to fail before a DataNode stops offering service. By default any volume failure will cause a datanode to shutdown" from the <em>hdfs-default.xml</em> description. You might want to set this to about half the amount of your available disks.</p> </div> </div> -</div> -<div class="sect3"> -<h4 id="hbase.regionserver.handler.count_description"><a class="anchor" href="#hbase.regionserver.handler.count_description"></a>9.2.3. <code>hbase.regionserver.handler.count</code></h4> +<div class="sect4"> +<h5 id="hbase.regionserver.handler.count"><a class="anchor" href="#hbase.regionserver.handler.count"></a><code>hbase.regionserver.handler.count</code></h5> <div class="paragraph"> <p>This setting defines the number of threads that are kept open to answer incoming requests to user tables. The rule of thumb is to keep this number low when the payload per request approaches the MB (big puts, scans using a large cache) and high when the payload is small (gets, small puts, ICVs, deletes). The total size of the queries in progress is limited by the setting <code>hbase.ipc.server.max.callqueue.size</code>.</p> @@ -5347,16 +5345,17 @@ A RegionServer running on low memory will trigger its JVM’s garbage collec <p>You can get a sense of whether you have too little or too many handlers by <a href="#rpc.logging">rpc.logging</a> on an individual RegionServer then tailing its logs (Queued requests consume memory).</p> </div> </div> +</div> <div class="sect3"> -<h4 id="big_memory"><a class="anchor" href="#big_memory"></a>9.2.4. Configuration for large memory machines</h4> +<h4 id="big_memory"><a class="anchor" href="#big_memory"></a>9.2.3. Configuration for large memory machines</h4> <div class="paragraph"> <p>HBase ships with a reasonable, conservative configuration that will work on nearly all machine types that people might want to test with. -If you have larger machines — HBase has 8G and larger heap — you might the following configuration options helpful. +If you have larger machines — HBase has 8G and larger heap — you might find the following configuration options helpful. TODO.</p> </div> </div> <div class="sect3"> -<h4 id="config.compression"><a class="anchor" href="#config.compression"></a>9.2.5. Compression</h4> +<h4 id="config.compression"><a class="anchor" href="#config.compression"></a>9.2.4. Compression</h4> <div class="paragraph"> <p>You should consider enabling ColumnFamily compression. There are several options that are near-frictionless and in most all cases boost performance by reducing the size of StoreFiles and thus reducing I/O.</p> @@ -5366,7 +5365,7 @@ There are several options that are near-frictionless and in most all cases boost </div> </div> <div class="sect3"> -<h4 id="config.wals"><a class="anchor" href="#config.wals"></a>9.2.6. Configuring the size and number of WAL files</h4> +<h4 id="config.wals"><a class="anchor" href="#config.wals"></a>9.2.5. Configuring the size and number of WAL files</h4> <div class="paragraph"> <p>HBase uses <a href="#wal">wal</a> to recover the memstore data that has not been flushed to disk in case of an RS failure. These WAL files should be configured to be slightly smaller than HDFS block (by default a HDFS block is 64Mb and a WAL file is ~60Mb).</p> @@ -5379,12 +5378,12 @@ However, as all memstores are not expected to be full all the time, less WAL fil </div> </div> <div class="sect3"> -<h4 id="disable.splitting"><a class="anchor" href="#disable.splitting"></a>9.2.7. Managed Splitting</h4> +<h4 id="disable.splitting"><a class="anchor" href="#disable.splitting"></a>9.2.6. Managed Splitting</h4> <div class="paragraph"> -<p>HBase generally handles splitting your regions, based upon the settings in your <em>hbase-default.xml</em> and <em>hbase-site.xml</em> configuration files. +<p>HBase generally handles splitting of your regions based upon the settings in your <em>hbase-default.xml</em> and <em>hbase-site.xml</em> configuration files. Important settings include <code>hbase.regionserver.region.split.policy</code>, <code>hbase.hregion.max.filesize</code>, <code>hbase.regionserver.regionSplitLimit</code>. A simplistic view of splitting is that when a region grows to <code>hbase.hregion.max.filesize</code>, it is split. -For most use patterns, most of the time, you should use automatic splitting. +For most usage patterns, you should use automatic splitting. See <a href="#manual_region_splitting_decisions">manual region splitting decisions</a> for more information about manual region splitting.</p> </div> <div class="paragraph"> @@ -5422,8 +5421,8 @@ It is better to err on the side of too few regions and perform rolling splits la The optimal number of regions depends upon the largest StoreFile in your region. The size of the largest StoreFile will increase with time if the amount of data grows. The goal is for the largest region to be just large enough that the compaction selection algorithm only compacts it during a timed major compaction. -Otherwise, the cluster can be prone to compaction storms where a large number of regions under compaction at the same time. -It is important to understand that the data growth causes compaction storms, and not the manual split decision.</p> +Otherwise, the cluster can be prone to compaction storms with a large number of regions under compaction at the same time. +It is important to understand that the data growth causes compaction storms and not the manual split decision.</p> </div> <div class="paragraph"> <p>If the regions are split into too many large regions, you can increase the major compaction interval by configuring <code>HConstants.MAJOR_COMPACTION_PERIOD</code>. @@ -5431,7 +5430,7 @@ HBase 0.90 introduced <code>org.apache.hadoop.hbase.util.RegionSplitter</code>, </div> </div> <div class="sect3"> -<h4 id="managed.compactions"><a class="anchor" href="#managed.compactions"></a>9.2.8. Managed Compactions</h4> +<h4 id="managed.compactions"><a class="anchor" href="#managed.compactions"></a>9.2.7. Managed Compactions</h4> <div class="paragraph"> <p>By default, major compactions are scheduled to run once in a 7-day period. Prior to HBase 0.96.x, major compactions were scheduled to happen once per day by default.</p> @@ -5462,7 +5461,7 @@ You can run major compactions manually via the HBase shell or via the <a href="h </div> </div> <div class="sect3"> -<h4 id="spec.ex"><a class="anchor" href="#spec.ex"></a>9.2.9. Speculative Execution</h4> +<h4 id="spec.ex"><a class="anchor" href="#spec.ex"></a>9.2.8. Speculative Execution</h4> <div class="paragraph"> <p>Speculative Execution of MapReduce tasks is on by default, and for HBase clusters it is generally advised to turn off Speculative Execution at a system-level unless you need it for a specific case, where it can be configured per-job. Set the properties <code>mapreduce.map.speculative</code> and <code>mapreduce.reduce.speculative</code> to false.</p> @@ -5503,9 +5502,9 @@ You might also see the graphs on the tail of <a href="https://issues.apache.org/ See the Deveraj Das and Nicolas Liochon blog post <a href="http://hortonworks.com/blog/introduction-to-hbase-mean-time-to-recover-mttr/">Introduction to HBase Mean Time to Recover (MTTR)</a> for a brief introduction.</p> </div> <div class="paragraph"> -<p>The issue <a href="https://issues.apache.org/jira/browse/HBASE-8389">HBASE-8354 forces Namenode into loop with lease recovery requests</a> is messy but has a bunch of good discussion toward the end on low timeouts and how to effect faster recovery including citation of fixes added to HDFS. Read the Varun Sharma comments. +<p>The issue <a href="https://issues.apache.org/jira/browse/HBASE-8389">HBASE-8354 forces Namenode into loop with lease recovery requests</a> is messy but has a bunch of good discussion toward the end on low timeouts and how to cause faster recovery including citation of fixes added to HDFS. Read the Varun Sharma comments. The below suggested configurations are Varun’s suggestions distilled and tested. -Make sure you are running on a late-version HDFS so you have the fixes he refers too and himself adds to HDFS that help HBase MTTR (e.g. +Make sure you are running on a late-version HDFS so you have the fixes he refers to and himself adds to HDFS that help HBase MTTR (e.g. HDFS-3703, HDFS-3712, and HDFS-4791 — Hadoop 2 for sure has them and late Hadoop 1 has some). Set the following in the RegionServer.</p> </div> <div class="listingblock"> @@ -20880,7 +20879,7 @@ See <a href="#block.cache">Block Cache</a></p> <div class="sect2"> <h3 id="perf.handlers"><a class="anchor" href="#perf.handlers"></a>96.3. <code>hbase.regionserver.handler.count</code></h3> <div class="paragraph"> -<p>See <a href="#hbase.regionserver.handler.count">[hbase.regionserver.handler.count]</a>.</p> +<p>See <a href="#hbase.regionserver.handler.count"><code>hbase.regionserver.handler.count</code></a>.</p> </div> </div> <div class="sect2">
http://git-wip-us.apache.org/repos/asf/hbase-site/blob/9fb0764b/bulk-loads.html ---------------------------------------------------------------------- diff --git a/bulk-loads.html b/bulk-loads.html index 298f611..f619129 100644 --- a/bulk-loads.html +++ b/bulk-loads.html @@ -7,7 +7,7 @@ <head> <meta charset="UTF-8" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> - <meta name="Date-Revision-yyyymmdd" content="20170707" /> + <meta name="Date-Revision-yyyymmdd" content="20170708" /> <meta http-equiv="Content-Language" content="en" /> <title>Apache HBase – Bulk Loads in Apache HBase (TM) @@ -311,7 +311,7 @@ under the License. --> <a href="https://www.apache.org/">The Apache Software Foundation</a>. All rights reserved. - <li id="publishDate" class="pull-right">Last Published: 2017-07-07</li> + <li id="publishDate" class="pull-right">Last Published: 2017-07-08</li> </p> </div>