Author: stack
Date: Thu May 12 18:51:10 2011
New Revision: 1102420
URL: http://svn.apache.org/viewvc?rev=1102420&view=rev
Log:
HBASE-3868 book.xml / troubleshooting.xml - porting wiki Troubleshooting page
Modified:
hbase/trunk/src/docbkx/book.xml
hbase/trunk/src/docbkx/troubleshooting.xml
Modified: hbase/trunk/src/docbkx/book.xml
URL:
http://svn.apache.org/viewvc/hbase/trunk/src/docbkx/book.xml?rev=1102420&r1=1102419&r2=1102420&view=diff
==============================================================================
--- hbase/trunk/src/docbkx/book.xml (original)
+++ hbase/trunk/src/docbkx/book.xml Thu May 12 18:51:10 2011
@@ -1321,8 +1321,7 @@ false
<question><para>Are there other HBase FAQs?</para></question>
<answer>
<para>
- See the FAQ that is up on the wiki, <link
xlink:href="http://wiki.apache.org/hadoop/Hbase/FAQ">HBase Wiki FAQ</link>
- as well as the <link
xlink:href="http://wiki.apache.org/hadoop/Hbase/Troubleshooting">Troubleshooting</link>
page.
+ See the FAQ that is up on the wiki, <link
xlink:href="http://wiki.apache.org/hadoop/Hbase/FAQ">HBase Wiki FAQ</link>.
</para>
</answer>
</qandaentry>
Modified: hbase/trunk/src/docbkx/troubleshooting.xml
URL:
http://svn.apache.org/viewvc/hbase/trunk/src/docbkx/troubleshooting.xml?rev=1102420&r1=1102419&r2=1102420&view=diff
==============================================================================
--- hbase/trunk/src/docbkx/troubleshooting.xml (original)
+++ hbase/trunk/src/docbkx/troubleshooting.xml Thu May 12 18:51:10 2011
@@ -85,7 +85,7 @@
To help debug this or confirm this is happening GC logging can be
turned on in the Java virtual machine.
</para>
<para>
- To enable, in hbase-env.sh add:
+ To enable, in <filename>hbase-env.sh</filename> add:
<programlisting>
export HBASE_OPTS="-XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -Xloggc:/home/hadoop/hbase/logs/gc-hbase.log"
</programlisting>
@@ -406,17 +406,47 @@ hadoop 17789 155 35.2 9067824 8604364
<section xml:id="trouble.client.scantimeout">
<title>ScannerTimeoutException</title>
<para>This is thrown if the time between RPC calls from the client
to RegionServer exceeds the scan timeout.
- For example, if Scan.setCaching is set to 500, then there will be
an RPC call to fetch the next batch of rows every 500 <code>.next()</code>
calls on the ResultScanner
+ For example, if <code>Scan.setCaching</code> is set to 500, then
there will be an RPC call to fetch the next batch of rows every 500
<code>.next()</code> calls on the ResultScanner
because data is being transferred in blocks of 500 rows to the
client. Reducing the setCaching value may be an option, but setting this value
too low makes for inefficient
processing on numbers of rows.
</para>
</section>
+ <section xml:id="trouble.client.scarylogs">
+ <title>Shell or client application throws lots of scary exceptions
during normal operation</title>
+ <para>Since 0.20.0 the default log level for
<code>org.apache.hadoop.hbase.*</code>is DEBUG. </para>
+ <para>
+ On your clients, edit
<filename>$HBASE_HOME/conf/log4j.properties</filename> and change this:
<code>log4j.logger.org.apache.hadoop.hbase=DEBUG</code> to this:
<code>log4j.logger.org.apache.hadoop.hbase=INFO</code>, or even
<code>log4j.logger.org.apache.hadoop.hbase=WARN</code>.
+ </para>
+ </section>
</section>
<section xml:id="trouble.rs">
<title>RegionServer</title>
<section xml:id="trouble.rs.startup">
<title>Startup Errors</title>
+ <section xml:id="trouble.rs.startup.master-no-region">
+ <title>Master Starts, But RegionServers Do Not</title>
+ <para>The Master believes the RegionServers have the IP of
127.0.0.1 - which is localhost and resolves to the master's own localhost.
+ </para>
+ <para>The RegionServers are erroneously informing the Master that
their IP addresses are 127.0.0.1.
+ </para>
+ <para>Modify <filename>/etc/hosts</filename> on the region
servers, from...
+ <programlisting>
+# Do not remove the following line, or various programs
+# that require network functionality will fail.
+127.0.0.1 fully.qualified.regionservername regionservername
localhost.localdomain localhost
+::1 localhost6.localdomain6 localhost6
+ </programlisting>
+ ... to (removing the master node's name from localhost)...
+ <programlisting>
+# Do not remove the following line, or various programs
+# that require network functionality will fail.
+127.0.0.1 localhost.localdomain localhost
+::1 localhost6.localdomain6 localhost6
+ </programlisting>
+ </para>
+ </section>
+
<section xml:id="trouble.rs.startup.compression">
<title>Compression Link Errors</title>
<para>
@@ -453,7 +483,8 @@ java.lang.UnsatisfiedLinkError: no gplco
<section xml:id="trouble.rs.runtime.oom-nt">
<title>System instability, and the presence of
"java.lang.OutOfMemoryError: unable to create new native thread in exceptions"
HDFS DataNode logs or that of any system daemon</title>
<para>
- See the Getting Started section on <link linkend="ulimit">ulimit
and nproc configuration</link>.
+ See the Getting Started section on <link linkend="ulimit">ulimit
and nproc configuration</link>. The default on recent Linux
+ distributions is 1024 - which is far too low for HBase.
</para>
</section>
<section xml:id="trouble.rs.runtime.gc">
@@ -477,6 +508,60 @@ java.lang.UnsatisfiedLinkError: no gplco
See the Getting Started section on <link linkend="ulimit">ulimit
and nproc configuration</link> and check your network.
</para>
</section>
+ <section xml:id="trouble.rs.runtime.zkexpired">
+ <title>ZooKeeper SessionExpired events</title>
+ <para>Master or RegionServers shutting down with messages like
those in the logs: </para>
+ <programlisting>
+WARN org.apache.zookeeper.ClientCnxn: Exception
+closing session 0x278bd16a96000f to sun.nio.ch.SelectionKeyImpl@355811ec
+java.io.IOException: TIMED OUT
+ at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
+WARN org.apache.hadoop.hbase.util.Sleeper: We slept 79410ms, ten times longer
than scheduled: 5000
+INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server
hostname/IP:PORT
+INFO org.apache.zookeeper.ClientCnxn: Priming connection to
java.nio.channels.SocketChannel[connected local=/IP:PORT
remote=hostname/IP:PORT]
+INFO org.apache.zookeeper.ClientCnxn: Server connection successful
+WARN org.apache.zookeeper.ClientCnxn: Exception closing session
0x278bd16a96000d to sun.nio.ch.SelectionKeyImpl@3544d65e
+java.io.IOException: Session Expired
+ at
org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
+ at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
+ at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
+ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
expired
+ </programlisting>
+ <para>
+ The JVM is doing a long running garbage collecting which is pausing
every threads (aka "stop the world").
+ Since the RegionServer's local ZooKeeper client cannot send
heartbeats, the session times out.
+ By design, we shut down any node that isn't able to contact the
ZooKeeper ensemble after getting a timeout so that it stops serving data that
may already be assigned elsewhere.
+ </para>
+ <para>
+ <itemizedlist>
+ <listitem>Make sure you give plenty of RAM (in
<filename>hbase-env.sh</filename>), the default of 1GB won't be able to sustain
long running imports.</listitem>
+ <listitem>Make sure you don't swap, the JVM never behaves well
under swapping.</listitem>
+ <listitem>Make sure you are not CPU starving the RegionServer
thread. For example, if you are running a MapReduce job using 6 CPU-intensive
tasks on a machine with 4 cores, you are probably starving the RegionServer
enough to create longer garbage collection pauses.</listitem>
+ <listitem>Increase the ZooKeeper session timeout</listitem>
+ </itemizedlist>
+ If you wish to increase the session timeout, add the following to
your <filename>hbase-site.xml</filename> to increase the timeout from the
default of 60 seconds to 120 seconds.
+ <programlisting>
+<property>
+ <name>zookeeper.session.timeout</name>
+ <value>1200000</value>
+</property>
+<property>
+ <name>hbase.zookeeper.property.tickTime</name>
+ <value>6000</value>
+</property>
+ </programlisting>
+ </para>
+ <para>
+ Be aware that setting a higher timeout means that the regions
served by a failed RegionServer will take at least
+ that amount of time to be transfered to another RegionServer. For a
production system serving live requests, we would instead
+ recommend setting it lower than 1 minute and over-provision your
cluster in order the lower the memory load on each machines (hence having
+ less garbage to collect per machine).
+ </para>
+ <para>
+ If this is happening during an upload which only happens once (like
initially loading all your data into HBase), consider bulk loading.
+ </para>
+ See <xref linkend="trouble.zookeeper.general"/> for other general
information about ZooKeeper troubleshooting.
+ </section>
</section>
<section xml:id="trouble.rs.shutdown">
@@ -485,16 +570,74 @@ java.lang.UnsatisfiedLinkError: no gplco
</section>
</section>
+
<section xml:id="trouble.master">
<title>Master</title>
<section xml:id="trouble.master.startup">
<title>Startup Errors</title>
-
+ <section xml:id="trouble.master.startup.migration">
+ <title>Master says that you need to run the hbase migrations
script</title>
+ <para>Upon running that, the hbase migrations script says no
files in root directory.</para>
+ <para>HBase expects the root directory to either not exist, or to
have already been initialized by hbase running a previous time. If you create a
new directory for HBase using Hadoop DFS, this error will occur.
+ Make sure the HBase root directory does not currently exist or
has been initialized by a previous run of HBase. Sure fire solution is to just
use Hadoop dfs to delete the HBase root and let HBase create and initialize the
directory itself.
+ </para>
+ </section>
+
</section>
- <section xml:id="trouble.master.startup">
+ <section xml:id="trouble.master.shutdown">
<title>Shutdown Errors</title>
</section>
</section>
+
+ <section xml:id="trouble.zookeeper">
+ <title>ZooKeeper</title>
+ <section xml:id="trouble.zookeeper.startup">
+ <title>Startup Errors</title>
+ <section xml:id="trouble.zookeeper.startup.address">
+ <title>Could not find my address: xyz in list of ZooKeeper quorum
servers</title>
+ <para>A ZooKeeper server wasn't able to start, throws that error.
xyz is the name of your server.</para>
+ <para>This is a name lookup problem. HBase tries to start a
ZooKeeper server on some machine but that machine isn't able to find itself in
the <varname>hbase.zookeeper.quorum</varname> configuration.
+ </para>
+ <para>Use the hostname presented in the error message instead of
the value you used. If you have a DNS server, you can set
<varname>hbase.zookeeper.dns.interface</varname> and
<varname>hbase.zookeeper.dns.nameserver</varname> in
<filename>hbase-site.xml</filename> to make sure it resolves to the correct
FQDN.
+ </para>
+ </section>
+
+ </section>
+ <section xml:id="trouble.zookeeper.general">
+ <title>ZooKeeper, The Cluster Canary</title>
+ <para>ZooKeeper is the cluster's "canary in the mineshaft". It'll be
the first to notice issues if any so making sure its happy is the short-cut to
a humming cluster.
+ </para>
+ <para>
+ See the <link
xlink:href="http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting">ZooKeeper
Operating Environment Troubleshooting</link> page. It has suggestions and tools
for checking disk and networking performance; i.e. the operating environment
your ZooKeeper and HBase are running in.
+ </para>
+ </section>
+
+ </section>
+
+ <section xml:id="trouble.ec2">
+ <title>Amazon EC2</title>
+ <section xml:id="trouble.ec2.zookeeper">
+ <title>ZooKeeper does not seem to work on Amazon EC2</title>
+ <para>HBase does not start when deployed as Amazon EC2 instances.
Exceptions like the below appear in the Master and/or RegionServer logs:
</para>
+ <programlisting>
+ 2009-10-19 11:52:27,030 INFO org.apache.zookeeper.ClientCnxn: Attempting
+ connection to server
ec2-174-129-15-236.compute-1.amazonaws.com/10.244.9.171:2181
+ 2009-10-19 11:52:27,032 WARN org.apache.zookeeper.ClientCnxn: Exception
+ closing session 0x0 to sun.nio.ch.SelectionKeyImpl@656dc861
+ java.net.ConnectException: Connection refused
+ </programlisting>
+ <para>
+ Security group policy is blocking the ZooKeeper port on a public
address.
+ Use the internal EC2 host names when configuring the ZooKeeper
quorum peer list.
+ </para>
+ </section>
+ <section xml:id="trouble.ec2.instability">
+ <title>Instability on Amazon EC2</title>
+ <para>Questions on HBase and Amazon EC2 come up frequently on the
HBase dist-list. Search for old threads using <link
xlink:href="http://search-hadoop.com/">Search Hadoop</link>
+ </para>
+ </section>
+ </section>
+
</chapter>