Hello,
I am having issues of regionservers shutting down because the underlying DFS
is having trouble. I cannot get the DFS to function properly. I attached a
log of our regionserver below. At first sight it looks like there is a
timeout that occurs while the regionservers was trying to connect to it's
own datanode:
java.net.SocketTimeoutException: 69000 millis timeout while waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/172.23.122.23:39203 remote=/172.23.122.23:50010]
The datanode logs show similar timeout errors (also when trying to connect
to local host).
Our configuration:
* Hadoop 0.20.1 and Hbase 0.20.2
* 20 nodes: each datanode/tasktracker and regionserver
* ulimit = 32000 on each node
* dfs.datanode.max.xcievers=5000 (should be big enough i presume)
* dfs.namenode.handler.count=30 (3 times the default)
* dfs.datanode.handler.count=9 (3 times the default)
* Our DFS contain around 500.000 blocks (64MB blocksize). Each datanode has
1600M heapsize. The datanodes are started with the -XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode options.
* The DFS is fairly heavily used. (Always running a few concurrent MapRed
jobs, with average IO)
Anyone a clue what might be going on?
Regards,
Ferdy.
THE REGIONSERVER LOG:
2010-01-20 13:52:15,982 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
started. Attempting to free 22928768 bytes
2010-01-20 13:52:15,988 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
completed. Freed 22985616 bytes. Priority Sizes: Single=84.519615MB
(88625240), Multi=99.30783MB (104131808),Memory=0.0MB (0)
2010-01-20 13:53:03,613 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=169.96802MB (178224384), Free=48.36949MB (50719088), Max=218.33751MB
(228943472), Counts: Blocks=2683, Access=184190, Hit=10090, Miss=174100,
Evictions=7, Evicted=2443, Ratios: Hit Ratio=5.478039011359215%, Miss
Ratio=94.521963596344%, Evicted/Run=349.0
2010-01-20 13:54:03,605 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=174.83826MB (183331208), Free=43.499245MB (45612264), Max=218.33751MB
(228943472), Counts: Blocks=2762, Access=184477, Hit=10298, Miss=174179,
Evictions=7, Evicted=2443, Ratios: Hit Ratio=5.582267791032791%, Miss
Ratio=94.41773295402527%, Evicted/Run=349.0
2010-01-20 13:55:45,744 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception for block
blk_-209323108721490976_6065406 java.net.SocketTimeoutException: 69000
millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/172.23.122.23:39203
remote=/172.23.122.23:50010]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.DataInputStream.readFully(Unknown Source)
at java.io.DataInputStream.readLong(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2399)
2010-01-20 13:55:45,744 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=178.23663MB (186894656), Free=40.100876MB (42048816), Max=218.33751MB
(228943472), Counts: Blocks=2816, Access=184676, Hit=10443, Miss=174233,
Evictions=7, Evicted=2443, Ratios: Hit Ratio=5.654768273234367%, Miss
Ratio=94.34522986412048%, Evicted/Run=349.0
2010-01-20 13:55:45,745 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x1264b1fe3240006 to sun.nio.ch.selectionkeyi...@2f242b11
java.io.IOException: TIMED OUT
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2010-01-20 13:55:45,746 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
66554ms, ten times longer than scheduled: 3000
2010-01-20 13:55:45,911 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
state: Disconnected, type: None, path: null
2010-01-20 13:55:46,178 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
at java.lang.Thread.run(Unknown Source)
2010-01-20 13:55:46,359 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-209323108721490976_6065406 bad datanode[0]
172.23.122.23:50010
2010-01-20 13:55:46,359 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-209323108721490976_6065406 in pipeline
172.23.122.23:50010, 172.23.122.16:50010,172.23.122.22:50010: bad datanode
172.23.122.23:50010
2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC
Server handler 3 on 60020 took 71023ms appending an edit to hlog;
editcount=41251
2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog:
regionserver/172.23.122.23:60020.logFlusher took 67200ms optional sync'ing
hlog; editcount=41271
2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC
Server handler 1 on 60020 took 69781ms appending an edit to hlog;
editcount=41271
2010-01-20 13:55:47,081 INFO org.apache.zookeeper.ClientCnxn: Attempting
connection to server c1.machines.nl/172.23.122.1:2181
2010-01-20 13:55:47,082 INFO org.apache.zookeeper.ClientCnxn: Priming
connection to java.nio.channels.SocketChannel[connected
local=/172.23.122.23:53151 remote=c1/172.23.122.1:2181]
2010-01-20 13:55:47,082 INFO org.apache.zookeeper.ClientCnxn: Server
connection successful
2010-01-20 13:55:47,083 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x1264b1fe3240006 to sun.nio.ch.selectionkeyi...@5f84f3d2
java.io.IOException: Session Expired
at
org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
at
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
2010-01-20 13:55:47,086 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
state: Expired, type: None, path: null
2010-01-20 13:55:47,086 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
expired