Thanks also for the considered message below J-A. I echo J-D's suggestion that the #3 note belongs on the troubleshooting page. Your suggestion would help with, https://issues.apache.org/jira/browse/HBASE-616, IIUC? If you agree I can add it, no problem. Will follow up on the other items in a while. St.Ack
On Mon, Feb 23, 2009 at 1:56 AM, Jean-Adrien <[email protected]> wrote: > > Hello, > > I recently upgraded to HBase / Hadoop 0.19.0 and I have some notes to > share: > > == 1. Xcievers. ================== > > First of all thanks you for this new version. > Regarding to the Hadoop setting about Xcievers threads, > as noticed by stack > (http://www.nabble.com/Datanode-Xceivers-td21372227.html#a21567289) > the settings for xcievers dfs.datanode.socket.write.timeout cat be enabled > without the problems noticed with 0.18.0 > > == 2. Stoping HBase ============== > I noticed a new problem problem: > If, for any reason, one of my datanode become unavailable for a while, my > dfs become in an instable state, and I have to restart the cluster. But > when > I want to stop HBase one of the region server never stops because it > retries > to reach an 'unavailable' block: > > ## region server log > 2009-02-23 09:50:09,864 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_7545556036225037274_1820952 failed because recovery > from primary datanode 192.168.1.10:50010 failed 6 times. Marking primary > datanode as bad. > 2009-02-23 09:50:10,864 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_7545556036225037274_1820952 bad datanode[1] > 192.168.1.10:50010 > 2009-02-23 09:50:10,864 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_7545556036225037274_1820952 in pipeline > 192.168.1.13:50010, 192.168.1.10:50010: bad datanode 192.168.1.10:50010 > 2009-02-23 09:50:10,869 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_7545556036225037274_1820952 failed because recovery > from primary datanode 192.168.1.13:50010 failed 1 times. Will retry... > 2009-02-23 09:50:11,869 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_7545556036225037274_1820952 bad datanode[1] > 192.168.1.10:50010 > 2009-02-23 09:50:11,869 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_7545556036225037274_1820952 in pipeline > 192.168.1.13:50010, 192.168.1.10:50010: bad datanode 192.168.1.10:50010 > 2009-02-23 09:50:11,875 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_7545556036225037274_1820952 failed because recovery > from primary datanode 192.168.1.13:50010 failed 2 times. Will retry... > > This is logged in an endless loop, > The datanode says the block is 'already commited' ? Corresponding error in > Datanode: > > ### datanode log > 2009-02-23 09:50:43,532 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls > recoverBlock(block=blk_7545556036225037274_1820952, > targets=[192.168.1.10:50010]) > 2009-02-23 09:50:43,534 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 5 on 50020, call recoverBlock(blk_7545556036225037274_1820952, > false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@1879f77) from > 192.168.1.13:48587: error: org.apache.hadoop.ipc.RemoteException: > java.io.IOException: blk_7545556036225037274_1820952 is already commited, > storedBlock == null. > at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:4536) > at > > org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:402) > at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) > > org.apache.hadoop.ipc.RemoteException: java.io.IOException: > blk_7545556036225037274_1820952 is already commited, storedBlock == null. > at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:4536) > at > > org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:402) > at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) > > at org.apache.hadoop.ipc.Client.call(Client.java:696) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) > at $Proxy4.nextGenerationStamp(Unknown Source) > at > > org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1466) > at > > org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1440) > at > > org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1506) > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) > > A kill (-TERM) on the region server launches the SIGTERM handler thread, > but > the loop which try to reach the block does not stop. > here is a thread dump of the region server at this time: > > ### thread dump of the region server > Full thread dump Java HotSpot(TM) Server VM (10.0-b23 mixed mode): > > "Thread-16" prio=10 tid=0x081c3c00 nid=0x6e81 in Object.wait() > [0x8f74f000..0x8f74fec0] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1143) > - locked <0x93d0d060> (a java.lang.Thread) > at java.lang.Thread.join(Thread.java:1196) > at org.apache.hadoop.hbase.util.Threads.shutdown(Threads.java:78) > at org.apache.hadoop.hbase.util.Threads.shutdown(Threads.java:66) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer$ShutdownThread.run(HRegionServer.java:814) > > "SIGTERM handler" daemon prio=10 tid=0x08dbcc00 nid=0x6e80 in Object.wait() > [0x8edaf000..0x8edaff40] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1143) > - locked <0x93d0f3c0> (a > org.apache.hadoop.hbase.regionserver.HRegionServer$ShutdownThread) > at java.lang.Thread.join(Thread.java:1196) > at > java.lang.ApplicationShutdownHooks.run(ApplicationShutdownHooks.java:79) > at java.lang.Shutdown.runHooks(Shutdown.java:89) > at java.lang.Shutdown.sequence(Shutdown.java:133) > at java.lang.Shutdown.exit(Shutdown.java:178) > - locked <0xb0d3ed60> (a java.lang.Class for java.lang.Shutdown) > at java.lang.Terminator$1.handle(Terminator.java:35) > at sun.misc.Signal$1.run(Signal.java:195) > at java.lang.Thread.run(Thread.java:619) > > "Attach Listener" daemon prio=10 tid=0x081e8800 nid=0x6e50 waiting on > condition [0x00000000..0x00000000] > java.lang.Thread.State: RUNNABLE > > "IPC Client (47) connection to /192.168.1.13:50020 from an unknown user" > daemon prio=10 tid=0x8e97a400 nid=0x39d1 in Object.wait() > [0x8ec81000..0x8ec81fc0] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:395) > - locked <0xa44a36a0> (a org.apache.hadoop.ipc.Client$Connection) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:437) > > "IPC Client (47) connection to /192.168.1.10:50020 from an unknown user" > daemon prio=10 tid=0x8e5f1c00 nid=0x523f in Object.wait() > [0x8ecd3000..0x8ecd3140] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:395) > - locked <0xa448ca50> (a org.apache.hadoop.ipc.Client$Connection) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:437) > > "BlockFSInputStreamReferenceQueueChecker" daemon prio=10 tid=0x085a9400 > nid=0x8c8 waiting on condition [0x8ee6c000..0x8ee6cdc0] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x93e76068> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) > at java.util.concurrent.DelayQueue.take(DelayQueue.java:160) > at > > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:582) > at > > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:575) > at > > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:946) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:906) > at java.lang.Thread.run(Thread.java:619) > > "IPC Server handler 9 on 60020" daemon prio=10 tid=0x0882e000 nid=0x889 > waiting on condition [0x8eebd000..0x8eebdfc0] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x93d0e9c8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881) > > "IPC Server handler 8 on 60020" daemon prio=10 tid=0x0882cc00 nid=0x888 > waiting on condition [0x8ef0e000..0x8ef0f040] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x93d0e9c8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881) > > "IPC Server handler 7 on 60020" daemon prio=10 tid=0x0882b800 nid=0x887 > waiting on condition [0x8ef5f000..0x8ef5fec0] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x93d0e9c8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881) > > "IPC Server handler 6 on 60020" daemon prio=10 tid=0x0882a400 nid=0x886 > waiting on condition [0x8efb0000..0x8efb0f40] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x93d0e9c8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881) > > "IPC Server handler 5 on 60020" daemon prio=10 tid=0x08829000 nid=0x885 > waiting on condition [0x8f001000..0x8f001dc0] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x93d0e9c8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881) > > "IPC Server handler 4 on 60020" daemon prio=10 tid=0x083bcc00 nid=0x884 > waiting on condition [0x8f052000..0x8f052e40] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x93d0e9c8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881) > > "IPC Server handler 3 on 60020" daemon prio=10 tid=0x083bb800 nid=0x883 > waiting on condition [0x8f0a3000..0x8f0a40c0] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x93d0e9c8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881) > > "IPC Server handler 2 on 60020" daemon prio=10 tid=0x083ba400 nid=0x882 > waiting on condition [0x8f0f4000..0x8f0f5140] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x93d0e9c8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881) > > "IPC Server handler 1 on 60020" daemon prio=10 tid=0x083b9400 nid=0x881 > waiting on condition [0x8f145000..0x8f145fc0] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x93d0e9c8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881) > > "IPC Server handler 0 on 60020" daemon prio=10 tid=0x083b8400 nid=0x880 > waiting on condition [0x8f196000..0x8f197040] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x93d0e9c8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881) > > "IPC Server listener on 60020" daemon prio=10 tid=0x083bfc00 nid=0x87f > runnable [0x8f1e7000..0x8f1e7ec0] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) > - locked <0x93d0c090> (a sun.nio.ch.Util$1) > - locked <0x93d0c120> (a java.util.Collections$UnmodifiableSet) > - locked <0x93d0c0b0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Listener.run(HBaseServer.java:299) > > "IPC Server Responder" daemon prio=10 tid=0x083bec00 nid=0x87e runnable > [0x8f238000..0x8f238f40] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) > - locked <0x93d0c0f0> (a sun.nio.ch.Util$1) > - locked <0x93d0c100> (a java.util.Collections$UnmodifiableSet) > - locked <0x93d0be30> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.run(HBaseServer.java:458) > > "SocketListener0-1" prio=10 tid=0x088d5000 nid=0x87c in Object.wait() > [0x8f2da000..0x8f2dae40] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522) > - locked <0x93e39bc8> (a org.mortbay.util.ThreadPool$PoolThread) > > "SocketListener0-0" prio=10 tid=0x08ab5000 nid=0x87b in Object.wait() > [0x8f32b000..0x8f32c0c0] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522) > - locked <0x93e39920> (a org.mortbay.util.ThreadPool$PoolThread) > > "Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=60030]" > prio=10 > tid=0x084f6800 nid=0x87a runnable [0x8f37c000..0x8f37d140] > java.lang.Thread.State: RUNNABLE > at java.net.PlainSocketImpl.socketAccept(Native Method) > at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:384) > - locked <0x93e382a8> (a java.net.SocksSocketImpl) > at java.net.ServerSocket.implAccept(ServerSocket.java:453) > at java.net.ServerSocket.accept(ServerSocket.java:421) > at > org.mortbay.util.ThreadedServer.acceptSocket(ThreadedServer.java:432) > at > org.mortbay.util.ThreadedServer$Acceptor.run(ThreadedServer.java:631) > > "SessionScavenger" daemon prio=10 tid=0x088f1400 nid=0x879 waiting on > condition [0x8f3cd000..0x8f3cdfc0] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > > org.mortbay.jetty.servlet.AbstractSessionManager$SessionScavenger.run(AbstractSessionManager.java:587) > > "SessionScavenger" daemon prio=10 tid=0x08ab5c00 nid=0x878 waiting on > condition [0x8f41e000..0x8f41f040] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > > org.mortbay.jetty.servlet.AbstractSessionManager$SessionScavenger.run(AbstractSessionManager.java:587) > > "regionserver/0.0.0.0:60020.leaseChecker" prio=10 tid=0x086d5800 nid=0x877 > waiting on condition [0x8f476000..0x8f476ec0] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x93d0e980> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963) > at java.util.concurrent.DelayQueue.poll(DelayQueue.java:201) > at org.apache.hadoop.hbase.Leases.run(Leases.java:78) > > "regionserver/0.0.0.0:60020.majorCompactionChecker" daemon prio=10 > tid=0x086d4000 nid=0x876 waiting on condition [0x8f4c7000..0x8f4c7f40] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:74) > at org.apache.hadoop.hbase.Chore.run(Chore.java:72) > > "LeaseChecker" daemon prio=10 tid=0x0837cc00 nid=0x870 waiting on condition > [0x8f6ad000..0x8f6ae040] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:979) > at java.lang.Thread.run(Thread.java:619) > > "DataStreamer for file > /hbase/log_192.168.1.13_1235129194745_60020/hlog.dat.1235129195649 block > blk_7545556036225037274_1820952" daemon prio=10 tid=0x08375400 nid=0x86f in > Object.wait() [0x8f6fe000..0x8f6feec0] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2166) > - locked <0x93d0ea10> (a java.util.LinkedList) > > "DestroyJavaVM" prio=10 tid=0x0805a000 nid=0x855 waiting on condition > [0x00000000..0xb7dd2090] > java.lang.Thread.State: RUNNABLE > > "regionserver/0.0.0.0:60020" prio=10 tid=0x085b2400 nid=0x863 in > Object.wait() [0x8f7f7000..0x8f7f7f40] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.flushInternal(DFSClient.java:3015) > - locked <0x93d0ea10> (a java.util.LinkedList) > - locked <0x93d0ec58> (a > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream) > at > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3104) > - locked <0x93d0ec58> (a > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053) > at > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79) > at > org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:959) > - locked <0x93d50d20> (a org.apache.hadoop.io.SequenceFile$Writer) > at org.apache.hadoop.hbase.regionserver.HLog.close(HLog.java:421) > - locked <0x93d4ce00> (a java.lang.Integer) > at > org.apache.hadoop.hbase.regionserver.HLog.closeAndDelete(HLog.java:404) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:373) > at java.lang.Thread.run(Thread.java:619) > > "Low Memory Detector" daemon prio=10 tid=0x080dd800 nid=0x85e runnable > [0x00000000..0x00000000] > java.lang.Thread.State: RUNNABLE > > "CompilerThread1" daemon prio=10 tid=0x080dc400 nid=0x85d waiting on > condition [0x00000000..0x8fbc1588] > java.lang.Thread.State: RUNNABLE > > "CompilerThread0" daemon prio=10 tid=0x080d9c00 nid=0x85c waiting on > condition [0x00000000..0x8fc42608] > java.lang.Thread.State: RUNNABLE > > "Signal Dispatcher" daemon prio=10 tid=0x080d8800 nid=0x85b runnable > [0x00000000..0x8fc93e80] > java.lang.Thread.State: RUNNABLE > > "Surrogate Locker Thread (CMS)" daemon prio=10 tid=0x080d7800 nid=0x85a > waiting on condition [0x00000000..0x8fce522c] > java.lang.Thread.State: RUNNABLE > > "Finalizer" daemon prio=10 tid=0x080bbc00 nid=0x859 in Object.wait() > [0x8fd7b000..0x8fd7bec0] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) > - locked <0x93d0bdc0> (a java.lang.ref.ReferenceQueue$Lock) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) > at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) > > "Reference Handler" daemon prio=10 tid=0x080ba800 nid=0x858 in > Object.wait() > [0x8fdcc000..0x8fdccf40] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) > - locked <0x93d0c110> (a java.lang.ref.Reference$Lock) > > "VM Thread" prio=10 tid=0x080b7800 nid=0x857 runnable > > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x0806c800 nid=0x856 runnable > "VM Periodic Task Thread" prio=10 tid=0x080df000 nid=0x85f waiting on > condition > > JNI global references: 922 > > This force me to kill (-KILL) the region server, and restart the hdfs (if I > don't restart hdfs some regions are not available because the block is > still > not available). > > Note that the Hbase master does not ends as well, and I have to kill > (-KILL) > it as well. Here is the thread dump of the HMaster process: > > ### Master thread dump: > > Full thread dump Java HotSpot(TM) Server VM (10.0-b23 mixed mode): > > "Attach Listener" daemon prio=10 tid=0x0805a000 nid=0x2326 waiting on > condition [0x00000000..0x00000000] > java.lang.Thread.State: RUNNABLE > > "SIGTERM handler" daemon prio=10 tid=0x089f0c00 nid=0x2303 waiting for > monitor entry [0x8a6a5000..0x8a6a5ec0] > java.lang.Thread.State: BLOCKED (on object monitor) > at java.lang.Shutdown.exit(Shutdown.java:178) > - waiting to lock <0xb0beed60> (a java.lang.Class for > java.lang.Shutdown) > at java.lang.Terminator$1.handle(Terminator.java:35) > at sun.misc.Signal$1.run(Signal.java:195) > at java.lang.Thread.run(Thread.java:619) > > "Thread-1" prio=10 tid=0x8f666000 nid=0x653c waiting on condition > [0x8f1fe000..0x8f1fef40] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.apache.hadoop.ipc.Client.stop(Client.java:667) > at org.apache.hadoop.ipc.RPC$ClientCache.stopClient(RPC.java:189) > at org.apache.hadoop.ipc.RPC$ClientCache.access$400(RPC.java:138) > at org.apache.hadoop.ipc.RPC$Invoker.close(RPC.java:229) > - locked <0x938a1b18> (a org.apache.hadoop.ipc.RPC$Invoker) > at org.apache.hadoop.ipc.RPC$Invoker.access$500(RPC.java:196) > at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:353) > at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:213) > - locked <0x93878470> (a org.apache.hadoop.hdfs.DFSClient) > at > > org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:243) > at > org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1413) > - locked <0x937d3c00> (a org.apache.hadoop.fs.FileSystem$Cache) > at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:236) > at > org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:221) > - locked <0x937b8a78> (a > org.apache.hadoop.fs.FileSystem$ClientFinalizer) > > "DestroyJavaVM" prio=10 tid=0x8f66c800 nid=0x36e9 in Object.wait() > [0xb7d7d000..0xb7d7e0e0] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1143) > - locked <0x937b8a78> (a > org.apache.hadoop.fs.FileSystem$ClientFinalizer) > at java.lang.Thread.join(Thread.java:1196) > at > java.lang.ApplicationShutdownHooks.run(ApplicationShutdownHooks.java:79) > at java.lang.Shutdown.runHooks(Shutdown.java:89) > at java.lang.Shutdown.sequence(Shutdown.java:133) > at java.lang.Shutdown.shutdown(Shutdown.java:200) > - locked <0xb0beed60> (a java.lang.Class for java.lang.Shutdown) > > "Low Memory Detector" daemon prio=10 tid=0x8fe03800 nid=0x36f4 runnable > [0x00000000..0x00000000] > java.lang.Thread.State: RUNNABLE > > "CompilerThread1" daemon prio=10 tid=0x8fe01c00 nid=0x36f3 waiting on > condition [0x00000000..0x8f9ee488] > java.lang.Thread.State: RUNNABLE > > "CompilerThread0" daemon prio=10 tid=0x8fe00800 nid=0x36f2 waiting on > condition [0x00000000..0x8fa6f508] > java.lang.Thread.State: RUNNABLE > > "Signal Dispatcher" daemon prio=10 tid=0x0813e800 nid=0x36f1 runnable > [0x00000000..0x8fac0d80] > java.lang.Thread.State: RUNNABLE > > "Surrogate Locker Thread (CMS)" daemon prio=10 tid=0x0813d800 nid=0x36f0 > waiting on condition [0x00000000..0x00000000] > java.lang.Thread.State: RUNNABLE > > "Finalizer" daemon prio=10 tid=0x08125000 nid=0x36ef in Object.wait() > [0x8fba1000..0x8fba1dc0] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) > - locked <0x937bafd8> (a java.lang.ref.ReferenceQueue$Lock) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) > at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) > > "Reference Handler" daemon prio=10 tid=0x08120c00 nid=0x36ee in > Object.wait() [0x8fbf2000..0x8fbf2e40] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) > - locked <0x937baf18> (a java.lang.ref.Reference$Lock) > > "VM Thread" prio=10 tid=0x0811d800 nid=0x36ed runnable > > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x0805e000 nid=0x36ea > runnable > > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x0805f000 nid=0x36eb > runnable > > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x080cd000 nid=0x36ec > runnable > "VM Periodic Task Thread" prio=10 tid=0x8fe05000 nid=0x36f5 waiting on > condition > > JNI global references: 1044 > > When I restart dfs, the log show me that the concerned block is marked as > invalidate, and it is therefore deleted: > > #### Namenode log > 2009-02-23 10:33:49,245 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.processReport: block blk_7545556036225037274_1820952 on > 192.168.1.10:50010 size 27895296 does not > belong to any file. > 2009-02-23 10:33:49,245 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addToInvalidates: blk_7545556036225037274 is added to invalidSet > of 192.168.1.10:50010 > > Finally, when I restart HBase, I can see some of these message in region > server log, and I have some data lost. > > #### Region server log: > 2009-02-23 10:38:55,831 WARN org.apache.hadoop.hbase.regionserver.HStore: > Exception processing reconstruction log > hdfs://lab5-2:9000/hbase/proxy-D-0.3/510178748/oldlogfile.log opening > [...@ff65bf -- continuing. Probably lack-of-HADOOP-1700 causing DATA LOSS! > java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:180) > at > > org.apache.hadoop.hbase.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:70) > at > > org.apache.hadoop.hbase.io.DataOutputBuffer.write(DataOutputBuffer.java:116) > at > org.apache.hadoop.hbase.io.SequenceFile$Reader.next(SequenceFile.java:1944) > at > org.apache.hadoop.hbase.io.SequenceFile$Reader.next(SequenceFile.java:1844) > at > org.apache.hadoop.hbase.io.SequenceFile$Reader.next(SequenceFile.java:1890) > at > > org.apache.hadoop.hbase.regionserver.HStore.doReconstructionLog(HStore.java:352) > at > > org.apache.hadoop.hbase.regionserver.HStore.runReconstructionLog(HStore.java:297) > at > org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:237) > at > > org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1764) > at > org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:276) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1367) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1338) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1253) > at java.lang.Thread.run(Thread.java:619) > > I have no idea what exactly happen which yield to this problem, but it is > certainly due because of my datanode became unresponsive for a while. > > > ==== 3. RS leases / JVM setting (troubleshooting section 7) ===== > > A useful note for people who suffers of swaping and have region servers > that > ends when they miss a lease to contact the master. In my case the delay > can > be up to 15 min when my system become overloaded because of IO during a > hard > swaping task. But even if the swap slows down the whole system, it is not a > reason for the timer to not fire as soon they have a bit of CPU available, > the reason of this is that the swap happen at garbage collector time of the > region server JVM. > The 'default' garbage collector of the JVM fills its entire space until it > starts a 'full' GC process in the 'old space' (see. above article for > details), then it blocks all the threads during GC, timers as well. > In my case I have many of this memory space in swap partition, and the GC > yield therefore to intensive swap and takes place during 10-15 minutes when > the region server is 'paused' and the leases miss. The problem is obviously > that the weakness of my hardware does not fit the minimal requirement for > HBase, nevertheless once can enable other GC implementation in HotSpot JVM, > e.g. using the 'concurent' garbage collector (-XX:+UseConcMarkSweepGC) > helps > a lot since it does not stop the threads when it takes place. Unfortuately, > even with this setting a 'full' gc happen sometimes if the memory become > too > fragmented, which yield to pause the jvm, swap and eventually make RS miss > the lease. Anyway it happens less frequently > > See http://java.sun.com/docs/hotspot/gc1.4.2/faq.html for more info about > JVM GC, and how to minimize 'full' gc events. > > Have a nice day > -- Jean-Adrien > > -- > View this message in context: > http://www.nabble.com/HBase-0.19.0%3A-Xcievers---shutdown---JVM-GC-settings-tp22158436p22158436.html > Sent from the HBase User mailing list archive at Nabble.com. > >
