Hello,
I am having issues of regionservers shutting down because the underlying
DFS is having trouble. I cannot get the DFS to function properly. I
attached a log of our regionserver below. At first sight it looks like
there is a timeout that occurs while the regionservers was trying to
connect to it's own datanode:
java.net.SocketTimeoutException: 69000 millis timeout while waiting for
channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/172.23.122.23:39203
remote=/172.23.122.23:50010]
The datanode logs show similar timeout errors (also when trying to
connect to local host).
Our configuration:
* Hadoop 0.20.1 and Hbase 0.20.2
* 20 nodes: each datanode/tasktracker and regionserver
* ulimit = 32000 on each node
* dfs.datanode.max.xcievers=5000 (should be big enough i presume)
* dfs.namenode.handler.count=30 (3 times the default)
* dfs.datanode.handler.count=9 (3 times the default)
* Our DFS contain around 500.000 blocks (64MB blocksize). Each datanode
has 1600M heapsize. The datanodes are started with the
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode options.
* The DFS is fairly heavily used. (Always running a few concurrent
MapRed jobs, with average IO)
Anyone a clue what might be going on?
Regards,
Ferdy.
THE REGIONSERVER LOG:
2010-01-20 13:52:15,982 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
started. Attempting to free 22928768 bytes
2010-01-20 13:52:15,988 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
completed. Freed 22985616 bytes. Priority Sizes: Single=84.519615MB
(88625240), Multi=99.30783MB (104131808),Memory=0.0MB (0)
2010-01-20 13:53:03,613 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=169.96802MB (178224384), Free=48.36949MB (50719088),
Max=218.33751MB (228943472), Counts: Blocks=2683, Access=184190,
Hit=10090, Miss=174100, Evictions=7, Evicted=2443, Ratios: Hit
Ratio=5.478039011359215%, Miss Ratio=94.521963596344%, Evicted/Run=349.0
2010-01-20 13:54:03,605 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=174.83826MB (183331208), Free=43.499245MB (45612264),
Max=218.33751MB (228943472), Counts: Blocks=2762, Access=184477,
Hit=10298, Miss=174179, Evictions=7, Evicted=2443, Ratios: Hit
Ratio=5.582267791032791%, Miss Ratio=94.41773295402527%, Evicted/Run=349.0
2010-01-20 13:55:45,744 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception for block
blk_-209323108721490976_6065406 java.net.SocketTimeoutException: 69000
millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/172.23.122.23:39203
remote=/172.23.122.23:50010]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.DataInputStream.readFully(Unknown Source)
at java.io.DataInputStream.readLong(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2399)
2010-01-20 13:55:45,744 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=178.23663MB (186894656), Free=40.100876MB (42048816),
Max=218.33751MB (228943472), Counts: Blocks=2816, Access=184676,
Hit=10443, Miss=174233, Evictions=7, Evicted=2443, Ratios: Hit
Ratio=5.654768273234367%, Miss Ratio=94.34522986412048%, Evicted/Run=349.0
2010-01-20 13:55:45,745 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x1264b1fe3240006 to sun.nio.ch.selectionkeyi...@2f242b11
java.io.IOException: TIMED OUT
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2010-01-20 13:55:45,746 WARN org.apache.hadoop.hbase.util.Sleeper: We
slept 66554ms, ten times longer than scheduled: 3000
2010-01-20 13:55:45,911 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
state: Disconnected, type: None, path: null
2010-01-20 13:55:46,178 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
at java.lang.Thread.run(Unknown Source)
2010-01-20 13:55:46,359 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-209323108721490976_6065406 bad datanode[0]
172.23.122.23:50010
2010-01-20 13:55:46,359 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-209323108721490976_6065406 in pipeline
172.23.122.23:50010, 172.23.122.16:50010,172.23.122.22:50010: bad
datanode 172.23.122.23:50010
2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog:
IPC Server handler 3 on 60020 took 71023ms appending an edit to hlog;
editcount=41251
2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog:
regionserver/172.23.122.23:60020.logFlusher took 67200ms optional
sync'ing hlog; editcount=41271
2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog:
IPC Server handler 1 on 60020 took 69781ms appending an edit to hlog;
editcount=41271
2010-01-20 13:55:47,081 INFO org.apache.zookeeper.ClientCnxn: Attempting
connection to server c1.machines.nl/172.23.122.1:2181
2010-01-20 13:55:47,082 INFO org.apache.zookeeper.ClientCnxn: Priming
connection to java.nio.channels.SocketChannel[connected
local=/172.23.122.23:53151 remote=c1/172.23.122.1:2181]
2010-01-20 13:55:47,082 INFO org.apache.zookeeper.ClientCnxn: Server
connection successful
2010-01-20 13:55:47,083 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x1264b1fe3240006 to sun.nio.ch.selectionkeyi...@5f84f3d2
java.io.IOException: Session Expired
at
org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
at
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
2010-01-20 13:55:47,086 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
state: Expired, type: None, path: null
2010-01-20 13:55:47,086 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
expired