Hello,

I am having issues of regionservers shutting down because the underlying DFS is having trouble. I cannot get the DFS to function properly. I attached a log of our regionserver below. At first sight it looks like there is a timeout that occurs while the regionservers was trying to connect to it's own datanode: java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.23.122.23:39203 remote=/172.23.122.23:50010]

The datanode logs show similar timeout errors (also when trying to connect to local host).

Our configuration:
* Hadoop 0.20.1 and Hbase 0.20.2
* 20 nodes: each datanode/tasktracker and regionserver
* ulimit = 32000 on each node
* dfs.datanode.max.xcievers=5000 (should be big enough i presume)
* dfs.namenode.handler.count=30 (3 times the default)
* dfs.datanode.handler.count=9 (3 times the default)
* Our DFS contain around 500.000 blocks (64MB blocksize). Each datanode has 1600M heapsize. The datanodes are started with the -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode options. * The DFS is fairly heavily used. (Always running a few concurrent MapRed jobs, with average IO)

Anyone a clue what might be going on?

Regards,
Ferdy.

THE REGIONSERVER LOG:
2010-01-20 13:52:15,982 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction started. Attempting to free 22928768 bytes 2010-01-20 13:52:15,988 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction completed. Freed 22985616 bytes. Priority Sizes: Single=84.519615MB (88625240), Multi=99.30783MB (104131808),Memory=0.0MB (0) 2010-01-20 13:53:03,613 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=169.96802MB (178224384), Free=48.36949MB (50719088), Max=218.33751MB (228943472), Counts: Blocks=2683, Access=184190, Hit=10090, Miss=174100, Evictions=7, Evicted=2443, Ratios: Hit Ratio=5.478039011359215%, Miss Ratio=94.521963596344%, Evicted/Run=349.0 2010-01-20 13:54:03,605 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=174.83826MB (183331208), Free=43.499245MB (45612264), Max=218.33751MB (228943472), Counts: Blocks=2762, Access=184477, Hit=10298, Miss=174179, Evictions=7, Evicted=2443, Ratios: Hit Ratio=5.582267791032791%, Miss Ratio=94.41773295402527%, Evicted/Run=349.0 2010-01-20 13:55:45,744 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_-209323108721490976_6065406 java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.23.122.23:39203 remote=/172.23.122.23:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
       at java.io.DataInputStream.readFully(Unknown Source)
       at java.io.DataInputStream.readLong(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2399)

2010-01-20 13:55:45,744 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=178.23663MB (186894656), Free=40.100876MB (42048816), Max=218.33751MB (228943472), Counts: Blocks=2816, Access=184676, Hit=10443, Miss=174233, Evictions=7, Evicted=2443, Ratios: Hit Ratio=5.654768273234367%, Miss Ratio=94.34522986412048%, Evicted/Run=349.0 2010-01-20 13:55:45,745 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x1264b1fe3240006 to sun.nio.ch.selectionkeyi...@2f242b11
java.io.IOException: TIMED OUT
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) 2010-01-20 13:55:45,746 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 66554ms, ten times longer than scheduled: 3000 2010-01-20 13:55:45,911 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: Disconnected, type: None, path: null 2010-01-20 13:55:46,178 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
       at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94) at org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48) at org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
       at java.lang.Thread.run(Unknown Source)
2010-01-20 13:55:46,359 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-209323108721490976_6065406 bad datanode[0] 172.23.122.23:50010 2010-01-20 13:55:46,359 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-209323108721490976_6065406 in pipeline 172.23.122.23:50010, 172.23.122.16:50010,172.23.122.22:50010: bad datanode 172.23.122.23:50010 2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler 3 on 60020 took 71023ms appending an edit to hlog; editcount=41251 2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog: regionserver/172.23.122.23:60020.logFlusher took 67200ms optional sync'ing hlog; editcount=41271 2010-01-20 13:55:46,623 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler 1 on 60020 took 69781ms appending an edit to hlog; editcount=41271 2010-01-20 13:55:47,081 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server c1.machines.nl/172.23.122.1:2181 2010-01-20 13:55:47,082 INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/172.23.122.23:53151 remote=c1/172.23.122.1:2181] 2010-01-20 13:55:47,082 INFO org.apache.zookeeper.ClientCnxn: Server connection successful 2010-01-20 13:55:47,083 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x1264b1fe3240006 to sun.nio.ch.selectionkeyi...@5f84f3d2
java.io.IOException: Session Expired
at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) 2010-01-20 13:55:47,086 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: Expired, type: None, path: null 2010-01-20 13:55:47,086 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired

Reply via email to