[Hadoop Wiki] Update of "Hbase/Troubleshooting" by Andr ewPurtell

Apache Wiki Fri, 18 Dec 2009 10:58:42 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hbase/Troubleshooting" page has been changed by AndrewPurtell.
http://wiki.apache.org/hadoop/Hbase/Troubleshooting?action=diff&rev1=34&rev2=35

--------------------------------------------------

   1. [[#16|Problem: Scanner performance is low]]
   1. [[#17|Problem: My shell or client application throws lots of scary 
exceptions during normal operation]]
  
+ 
  <<Anchor(1)>>
  == 1. Problem: Master initializes, but Region Servers do not ==
   * Master's log contains repeated instances of the following block:
@@ -50, +51 @@

  127.0.0.1             localhost.localdomain localhost
  ::1           localhost6.localdomain6 localhost6
  }}}
+ 
+ 
  <<Anchor(2)>> 
  == 2. Problem: Created Root Directory for HBase through Hadoop DFS ==
   * On Startup, Master says that you need to run the hbase migrations script. 
Upon running that, the hbase migrations script says no files in root directory.
@@ -57, +60 @@

   * HBase expects the root directory to either not exist, or to have already 
been initialized by hbase running a previous time. If you create a new 
directory for HBase using Hadoop DFS, this error will occur.
  === Resolution ===
   * Make sure the HBase root directory does not currently exist or has been 
initialized by a previous run of HBase. Sure fire solution is to just use 
Hadoop dfs to delete the HBase root and let HBase create and initialize the 
directory itself.
+ 
  
  <<Anchor(3)>>
  == 3. Problem: Replay of hlog required, forcing regionserver restart ==
@@ -77, +81 @@

  === Causes ===
   * RPC timeouts may happen because of a IO contention which blocks processes 
during file swapping.
  === Resolution ===
+  * Configure your system to avoid swapping. Set vm.swappiness to 0. 
(http://kerneltrap.org/node/1044)
+ 
  
  <<Anchor(4)>>
  == 4. Problem: On migration, no files in root directory ==
@@ -86, +92 @@

  === Resolution ===
   * Make sure the HBase root directory does not currently exist or has been 
initialized by a previous run of HBase. Sure fire solution is to just use 
Hadoop dfs to delete the HBase root and let HBase create and initialize the 
directory itself.
  
- == Problem: lots of DFS error regarding can not find block from live nodes ==
-  * Under heavy read load, you may see lots of DFSClient complains about no 
live nodes hold a particular block. And, hadoop DataNode logs show xceiverCount 
exceed the limit (256)
- 
- === Causes ===
-  * not enough xceiver thread on DataNode to serve the traffic
- 
- === Resolution ===
-  * Either reduce the load or set dfs.datanode.max.xcievers (hadoop-site.xml) 
to a larger value than the default (256). Note that in order to change the 
tunable, you need 0.17.2 or 0.18.0 (HADOOP-3859).
  
  <<Anchor(5)>>
  == 5. Problem: "xceiverCount 258 exceeds the limit of concurrent xcievers 
256" ==
@@ -101, +99 @@

  === Causes ===
   * An upper bound on connections was added in Hadoop 
(HADOOP-3633/HADOOP-3859).
  === Resolution ===
-  * Up the maximum by setting '''dfs.datanode.max.xcievers''' (sic).  See 
[[http://mail-archives.apache.org/mod_mbox/hadoop-hbase-user/200810.mbox/%[email protected]%3e|message
 from jean-adrien]] for some background.
+  * Up the maximum by setting '''dfs.datanode.max.xcievers''' (sic).  See 
[[http://mail-archives.apache.org/mod_mbox/hadoop-hbase-user/200810.mbox/%[email protected]%3e|message
 from jean-adrien]] for some background. Values of 2048 or 4096 are common. 
  
  
  <<Anchor(6)>>
@@ -133, +131 @@

  2009-02-24 10:01:36,472 WARN 
org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master 
for xxx milliseconds - retrying
  }}}
  === Causes ===
-  * Excessive connection establishment latency (HRPC sets up connections on 
demand)
   * Slow host name resolution
+  * Very long garbage collector task for the RegionServer JVM: The ''default 
garbage collector'' of the HotspotTM JavaTM Virtual Machine runs a ''full gc'' 
on the ''old space'' when the memory space is full, which can represent about 
90% of the allocated heap space. During this task, the running program is 
stopped, the timers as well. If the heap space is mostly in the swap partition, 
and moreover if it is larger than the physical memory, the subsequent swap can 
yield to I/O overload and takes several minutes.
   * Network bandwidth overcommitment
-  * Very long garbage collector task for the RegionServer JVM: The ''default 
garbage collector'' of the HotspotTM JavaTM Virtual Machine runs a ''full gc'' 
on the ''old space'' when the memory space is full, which can represent about 
90% of the allocated heap space. During this task, the running program is 
stopped, the timers as well. If the heap space is mostly in the swap partition, 
and moreover if it is larger than the physical memory, the subsequent swap can 
yield to I/O overload and takes several minutes.
  === Resolution ===
   * Insure that host name resolution latency is low, or use static entries in 
/etc/hosts
   * Monitor the network and insure that adequate bandwidth is available for 
HRPC transactions
@@ -146, +143 @@

   * For Java SE 6, some users have had success with {{{ 
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:ParallelGCThreads=8 }}}.
   * See HBase [[PerformanceTuning|Performance Tuning]] for more on JVM GC 
tuning.
  
+ 
  <<Anchor(8)>>
  == 8. Problem: Instability on Amazon EC2 ==
   * Various problems suggesting overloading on Amazon EC2 deployments: Scanner 
timeouts, problems locating HDFS blocks, missed heartbeats, "We slept xxx ms, 
ten times longer than scheduled" messages, and so on. 
@@ -158, +156 @@

   * Use X-Large (c1.xlarge) instances
   * Consider splitting storage and computational function over disjoint 
instance sets. 
  
+ 
  <<Anchor(9)>>
  == 9. Problem: ZooKeeper SessionExpired events ==
   * Master or RegionServers reinitialize their ZooKeeper wrappers after 
receiving SessionExpired events.
@@ -174, +173 @@

  }}}
   * For Java SE 6, some users have had success with {{{ 
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:ParallelGCThreads=8 }}}.  
See HBase [[PerformanceTuning|Performance Tuning]] for more on JVM GC tuning.
  
+ 
  <<Anchor(10)>>
  == 10. Problem: Scanners keep getting timeouts ==
   * Client receives org.apache.hadoop.hbase.UnknownScannerException or 
timeouts even if the region server lease is really high. Fixed in HBase 0.20.0
@@ -182, +182 @@

  === Resolution ===
   * Set hbase.client.scanner.caching in hbase-site.xml at a very low value 
like 1 or use HTable.setScannerCaching(1).
  
+ 
  <<Anchor(11)>>
  == 11. Problem: Client says no such table but it exists ==
   * Client can't find region in table, says no such table.
@@ -199, +200 @@

  === Resolution ===
   * Use the hostname presented in the error message instead of the value you 
used. If you have a DNS server, you can set '''hbase.zookeeper.dns.interface''' 
and '''hbase.zookeeper.dns.nameserver''' in hbase-site.xml to make sure it 
resolves to the correct FQDN.
  
+ 
  <<Anchor(13)>>
- == 13. Problem: Long client pauses under high load; or deadlock if using 
THBase ==
+ == 13. Problem: Long client pauses under high load; or deadlock if using 
transactional HBase (THBase)==
   * Under high load, some client operations take a long time; waiting appears 
uneven
-  * If using transactional HBase (THBase), apparent deadlocks: for example, in 
thread dumps IPC Server handlers are blocked in 
org.apache.hadoop.hbase.regionserver.tableindexed.IndexedRegion.updateIndex()
+  * If using THBase, apparent deadlocks: for example, in thread dumps IPC 
Server handlers are blocked in 
org.apache.hadoop.hbase.regionserver.tableindexed.IndexedRegion.updateIndex()
  == Causes ==
   * The default number of regionserver RPC handlers is insufficient.
  == Resolution ==
   * Increase the value of "hbase.regionserver.handler.count" in 
hbase-site.xml. The default is 10. Try 100.
+ 
  
  <<Anchor(14)>>
  == 14. Problem: Zookeeper does not seem to work on Amazon EC2 ==
@@ -224, +227 @@

  == Resolution ==
   * Use the internal EC2 host names when configuring the Zookeeper quorum peer 
list. 
  
+ 
  <<Anchor(15)>>
  == 15. Problem: General operating environment issues -- zookeeper session 
timeouts, regionservers shutting down, etc ==
  == Causes ==
@@ -231, +235 @@

  == Resolution ==
  See the [[http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting ZooKeeper 
Operating Environment Troubleshooting]] page.  It has suggestions and tools for 
checking disk and networking performance; i.e. the operating environment your 
zookeeper and hbase are running in.  ZooKeeper is the cluster's "canary".  
It'll be the first to notice issues if any so making sure its happy is the 
short-cut to a humming cluster.
  
+ 
  <<Anchor(16)>>
  == 16. Problem: Scanner performance is low ==
  == Causes ==
@@ -239, +244 @@

   * Increase the amount of prefetching on the scanner, to 10, or 100, or 1000, 
as appropriate for your workload: 
[[http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#scannerCaching|HTable.scannerCaching]]
   * This change can be accomplished globally by setting the 
hbase.client.scanner.caching property in hbase-site.xml to the desired value.
  
+ 
  <<Anchor(17)>>
  == 17. Problem: My shell or client application throws lots of scary 
exceptions during normal operation ==
  == Causes ==

[Hadoop Wiki] Update of "Hbase/Troubleshooting" by Andr ewPurtell

Reply via email to