Re: Regionserver problems because of datanode timeouts

Jean-Daniel Cryans Wed, 20 Jan 2010 10:33:21 -0800

I see more than one problem here.

> DFSOutputStream ResponseProcessor exception  for block
> blk_-209323108721490976_6065406 java.net.SocketTimeoutException: 69000


As you said your HDFS is getting hit a lot and this is a sign of it.
Some people use dfs.datanode.socket.write.timeout and
dfs.datanode.socket.read.timeout set to very high values in that case
(like more than 10 minutes). If you want to serve live data from HBase
it's really not the best and you should consider separating processing
and serving in 2 clusters OR lower the number of tasks on the nodes.

> java.io.IOException: TIMED OUT
>       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2010-01-20 13:55:45,746 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
> 66554ms, ten times longer than scheduled: 3000

This means your region server was garbage collecting for more than a
minute, which is higher than the default time to consider a region
server as dead. This lead to:

> java.io.IOException: Session Expired
>       at
> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)

This is how we notice a server is gone missing, each region server has
an ephemeral node in Zookeeper and they lose it after 40 secs in
0.20.2 (and previous) and after 60 secs in 0.20.3.

Consider giving at least 4GB of heap to the region servers and make
sure you never swap.

J-D

On Wed, Jan 20, 2010 at 8:22 AM, Ferdy <ferdy.gal...@kalooga.com> wrote:
> Hello,
>
> I am having issues of regionservers shutting down because the underlying DFS
> is having trouble. I cannot get the DFS to function properly. I attached a
> log of our regionserver below. At first sight it looks like there is a
> timeout that occurs while the regionservers was trying to connect to it's
> own datanode:
> java.net.SocketTimeoutException: 69000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/172.23.122.23:39203 remote=/172.23.122.23:50010]
>
> The datanode logs show similar timeout errors (also when trying to connect
> to local host).
>
> Our configuration:
> * Hadoop 0.20.1 and Hbase 0.20.2
> * 20 nodes: each datanode/tasktracker and regionserver
> * ulimit = 32000 on each node
> * dfs.datanode.max.xcievers=5000 (should be big enough i presume)
> * dfs.namenode.handler.count=30 (3 times the default)
> * dfs.datanode.handler.count=9 (3 times the default)
> * Our DFS contain around 500.000 blocks (64MB blocksize). Each datanode has
> 1600M heapsize. The datanodes are started with the -XX:+UseConcMarkSweepGC
> -XX:+CMSIncrementalMode options.
> * The DFS is fairly heavily used. (Always running a few concurrent MapRed
> jobs, with average IO)
>
> Anyone a clue what might be going on?
>
> Regards,
> Ferdy.
>
> THE REGIONSERVER LOG:
> 2010-01-20 13:52:15,982 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> started.  Attempting to free 22928768 bytes
> 2010-01-20 13:52:15,988 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> completed. Freed 22985616 bytes.  Priority Sizes: Single=84.519615MB
> (88625240), Multi=99.30783MB (104131808),Memory=0.0MB (0)
> 2010-01-20 13:53:03,613 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=169.96802MB (178224384), Free=48.36949MB (50719088), Max=218.33751MB
> (228943472), Counts: Blocks=2683, Access=184190, Hit=10090, Miss=174100,
> Evictions=7, Evicted=2443, Ratios: Hit Ratio=5.478039011359215%, Miss
> Ratio=94.521963596344%, Evicted/Run=349.0
> 2010-01-20 13:54:03,605 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=174.83826MB (183331208), Free=43.499245MB (45612264), Max=218.33751MB
> (228943472), Counts: Blocks=2762, Access=184477, Hit=10298, Miss=174179,
> Evictions=7, Evicted=2443, Ratios: Hit Ratio=5.582267791032791%, Miss
> Ratio=94.41773295402527%, Evicted/Run=349.0
> 2010-01-20 13:55:45,744 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_-209323108721490976_6065406 java.net.SocketTimeoutException: 69000
> millis timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/172.23.122.23:39203
> remote=/172.23.122.23:50010]
>       at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>       at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>       at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>       at java.io.DataInputStream.readFully(Unknown Source)
>       at java.io.DataInputStream.readLong(Unknown Source)
>       at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2399)
>
> 2010-01-20 13:55:45,744 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=178.23663MB (186894656), Free=40.100876MB (42048816), Max=218.33751MB
> (228943472), Counts: Blocks=2816, Access=184676, Hit=10443, Miss=174233,
> Evictions=7, Evicted=2443, Ratios: Hit Ratio=5.654768273234367%, Miss
> Ratio=94.34522986412048%, Evicted/Run=349.0
> 2010-01-20 13:55:45,745 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x1264b1fe3240006 to sun.nio.ch.selectionkeyi...@2f242b11
> java.io.IOException: TIMED OUT
>       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2010-01-20 13:55:45,746 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
> 66554ms, ten times longer than scheduled: 3000
> 2010-01-20 13:55:45,911 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Disconnected, type: None, path: null
> 2010-01-20 13:55:46,178 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>       at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
> Source)
>       at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> Source)
>       at java.lang.reflect.Constructor.newInstance(Unknown Source)
>       at
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>       at
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>       at
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
>       at
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
>       at java.lang.Thread.run(Unknown Source)
> 2010-01-20 13:55:46,359 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-209323108721490976_6065406 bad datanode[0]
> 172.23.122.23:50010
> 2010-01-20 13:55:46,359 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-209323108721490976_6065406 in pipeline
> 172.23.122.23:50010, 172.23.122.16:50010,172.23.122.22:50010: bad datanode
> 172.23.122.23:50010
> 2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC
> Server handler 3 on 60020 took 71023ms appending an edit to hlog;
> editcount=41251
> 2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog:
> regionserver/172.23.122.23:60020.logFlusher took 67200ms optional sync'ing
> hlog; editcount=41271
> 2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC
> Server handler 1 on 60020 took 69781ms appending an edit to hlog;
> editcount=41271
> 2010-01-20 13:55:47,081 INFO org.apache.zookeeper.ClientCnxn: Attempting
> connection to server c1.machines.nl/172.23.122.1:2181
> 2010-01-20 13:55:47,082 INFO org.apache.zookeeper.ClientCnxn: Priming
> connection to java.nio.channels.SocketChannel[connected
> local=/172.23.122.23:53151 remote=c1/172.23.122.1:2181]
> 2010-01-20 13:55:47,082 INFO org.apache.zookeeper.ClientCnxn: Server
> connection successful
> 2010-01-20 13:55:47,083 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x1264b1fe3240006 to sun.nio.ch.selectionkeyi...@5f84f3d2
> java.io.IOException: Session Expired
>       at
> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
>       at
> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
>       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> 2010-01-20 13:55:47,086 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Expired, type: None, path: null
> 2010-01-20 13:55:47,086 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
> expired
>
>

Re: Regionserver problems because of datanode timeouts

Reply via email to