Re: Regionserver problems because of datanode timeouts

Ferdy Wed, 03 Feb 2010 01:31:02 -0800

Hi,

Increasing the memory did not work. Is it possible to increase the timeout period? All our jobs are offline processing jobs, so I'd rather havea low responsiveness than a Zookeeper that decides to terminate aregionserver.

I also noticed that a regionserver crash also happens when there is noload at all on Hbase, in other words they seem to occur randomly.


Ferdy

Jean-Daniel Cryans wrote:

I see more than one problem here.

DFSOutputStream ResponseProcessor exception  for block
blk_-209323108721490976_6065406 java.net.SocketTimeoutException: 69000


As you said your HDFS is getting hit a lot and this is a sign of it.
Some people use dfs.datanode.socket.write.timeout and
dfs.datanode.socket.read.timeout set to very high values in that case
(like more than 10 minutes). If you want to serve live data from HBase
it's really not the best and you should consider separating processing
and serving in 2 clusters OR lower the number of tasks on the nodes.

java.io.IOException: TIMED OUT
      at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2010-01-20 13:55:45,746 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
66554ms, ten times longer than scheduled: 3000


This means your region server was garbage collecting for more than a
minute, which is higher than the default time to consider a region
server as dead. This lead to:

java.io.IOException: Session Expired
      at
org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)


This is how we notice a server is gone missing, each region server has
an ephemeral node in Zookeeper and they lose it after 40 secs in
0.20.2 (and previous) and after 60 secs in 0.20.3.

Consider giving at least 4GB of heap to the region servers and make
sure you never swap.

J-D

On Wed, Jan 20, 2010 at 8:22 AM, Ferdy <ferdy.gal...@kalooga.com> wrote:

Hello,

I am having issues of regionservers shutting down because the underlying DFS
is having trouble. I cannot get the DFS to function properly. I attached a
log of our regionserver below. At first sight it looks like there is a
timeout that occurs while the regionservers was trying to connect to it's
own datanode:
java.net.SocketTimeoutException: 69000 millis timeout while waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/172.23.122.23:39203 remote=/172.23.122.23:50010]

The datanode logs show similar timeout errors (also when trying to connect
to local host).

Our configuration:
* Hadoop 0.20.1 and Hbase 0.20.2
* 20 nodes: each datanode/tasktracker and regionserver
* ulimit = 32000 on each node
* dfs.datanode.max.xcievers=5000 (should be big enough i presume)
* dfs.namenode.handler.count=30 (3 times the default)
* dfs.datanode.handler.count=9 (3 times the default)
* Our DFS contain around 500.000 blocks (64MB blocksize). Each datanode has
1600M heapsize. The datanodes are started with the -XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode options.
* The DFS is fairly heavily used. (Always running a few concurrent MapRed
jobs, with average IO)

Anyone a clue what might be going on?

Regards,
Ferdy.

THE REGIONSERVER LOG:
2010-01-20 13:52:15,982 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
started.  Attempting to free 22928768 bytes
2010-01-20 13:52:15,988 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
completed. Freed 22985616 bytes.  Priority Sizes: Single=84.519615MB
(88625240), Multi=99.30783MB (104131808),Memory=0.0MB (0)
2010-01-20 13:53:03,613 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=169.96802MB (178224384), Free=48.36949MB (50719088), Max=218.33751MB
(228943472), Counts: Blocks=2683, Access=184190, Hit=10090, Miss=174100,
Evictions=7, Evicted=2443, Ratios: Hit Ratio=5.478039011359215%, Miss
Ratio=94.521963596344%, Evicted/Run=349.0
2010-01-20 13:54:03,605 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=174.83826MB (183331208), Free=43.499245MB (45612264), Max=218.33751MB
(228943472), Counts: Blocks=2762, Access=184477, Hit=10298, Miss=174179,
Evictions=7, Evicted=2443, Ratios: Hit Ratio=5.582267791032791%, Miss
Ratio=94.41773295402527%, Evicted/Run=349.0
2010-01-20 13:55:45,744 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception  for block
blk_-209323108721490976_6065406 java.net.SocketTimeoutException: 69000
millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/172.23.122.23:39203
remote=/172.23.122.23:50010]
      at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
      at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
      at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
      at java.io.DataInputStream.readFully(Unknown Source)
      at java.io.DataInputStream.readLong(Unknown Source)
      at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2399)

2010-01-20 13:55:45,744 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=178.23663MB (186894656), Free=40.100876MB (42048816), Max=218.33751MB
(228943472), Counts: Blocks=2816, Access=184676, Hit=10443, Miss=174233,
Evictions=7, Evicted=2443, Ratios: Hit Ratio=5.654768273234367%, Miss
Ratio=94.34522986412048%, Evicted/Run=349.0
2010-01-20 13:55:45,745 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x1264b1fe3240006 to sun.nio.ch.selectionkeyi...@2f242b11
java.io.IOException: TIMED OUT
      at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2010-01-20 13:55:45,746 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
66554ms, ten times longer than scheduled: 3000
2010-01-20 13:55:45,911 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
state: Disconnected, type: None, path: null
2010-01-20 13:55:46,178 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
      at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
      at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
Source)
      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
Source)
      at java.lang.reflect.Constructor.newInstance(Unknown Source)
      at
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
      at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
      at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
      at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
      at java.lang.Thread.run(Unknown Source)
2010-01-20 13:55:46,359 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-209323108721490976_6065406 bad datanode[0]
172.23.122.23:50010
2010-01-20 13:55:46,359 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-209323108721490976_6065406 in pipeline
172.23.122.23:50010, 172.23.122.16:50010,172.23.122.22:50010: bad datanode
172.23.122.23:50010
2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC
Server handler 3 on 60020 took 71023ms appending an edit to hlog;
editcount=41251
2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog:
regionserver/172.23.122.23:60020.logFlusher took 67200ms optional sync'ing
hlog; editcount=41271
2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC
Server handler 1 on 60020 took 69781ms appending an edit to hlog;
editcount=41271
2010-01-20 13:55:47,081 INFO org.apache.zookeeper.ClientCnxn: Attempting
connection to server c1.machines.nl/172.23.122.1:2181
2010-01-20 13:55:47,082 INFO org.apache.zookeeper.ClientCnxn: Priming
connection to java.nio.channels.SocketChannel[connected
local=/172.23.122.23:53151 remote=c1/172.23.122.1:2181]
2010-01-20 13:55:47,082 INFO org.apache.zookeeper.ClientCnxn: Server
connection successful
2010-01-20 13:55:47,083 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x1264b1fe3240006 to sun.nio.ch.selectionkeyi...@5f84f3d2
java.io.IOException: Session Expired
      at
org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
      at
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
      at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
2010-01-20 13:55:47,086 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
state: Expired, type: None, path: null
2010-01-20 13:55:47,086 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
expired

Re: Regionserver problems because of datanode timeouts

Reply via email to