It sounds like all the RS handlers were full and we're missing that thread dump. Next time please jstack everything, alright? ;)
J-D On Wed, Apr 3, 2013 at 11:31 PM, ramkrishna vasudevan <[email protected]> wrote: > Hi All > > During some random testing i faced this issue in Trunk version. > > Master and RS are on the same node and i have only one node. > > I created some tables, disabled and dropped it. > > The META was intact. > > Again when i tried to create the table, the Meta could not be contacted > though in the same machine. > > > On restarting the master things became fine again. I am not able to > reproduce this issue. But for sometime nothing actually worked because > the META RS became unreachable. > > {logs} > 2013-04-03 17:21:21,675 DEBUG > org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Deleting regions > from META > 2013-04-03 17:21:21,679 INFO org.apache.hadoop.hbase.catalog.MetaEditor: > Deleted from META, regions: [{NAME => > 'TestTable,,1365024046851.e5e94a1c4adc038e6517e45bcbbab9a9.', STARTKEY => > '', ENDKEY => '', ENCODED => e5e94a1c4adc038e6517e45bcbbab9a9,}] > 2013-04-03 17:21:21,683 DEBUG > org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Archiving region > TestTable,,1365024046851.e5e94a1c4adc038e6517e45bcbbab9a9. from FS > 2013-04-03 17:21:21,683 DEBUG org.apache.hadoop.hbase.backup.HFileArchiver: > ARCHIVING region > hdfs://localhost:9010/hbase/.tmp/TestTable/e5e94a1c4adc038e6517e45bcbbab9a9 > ------------ > (This is where the table got deleted) > 2013-04-03 17:21:21,705 DEBUG org.apache.hadoop.hbase.backup.HFileArchiver: > Deleted all region files in: > hdfs://localhost:9010/hbase/.tmp/TestTable/e5e94a1c4adc038e6517e45bcbbab9a9 > 2013-04-03 17:21:21,706 DEBUG > org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Table > 'TestTable' archived! > 2013-04-03 17:21:21,706 DEBUG > org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Removing > 'TestTable' descriptor. > 2013-04-03 17:21:21,708 DEBUG > org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Marking > 'TestTable' as deleted. > 2013-04-03 17:21:21,709 DEBUG > org.apache.hadoop.hbase.master.TableLockManager: Attempt to release table > write lock on :TestTable > 2013-04-03 17:21:21,712 DEBUG > org.apache.hadoop.hbase.zookeeper.lock.ZKInterProcessLockBase: Successfully > released /hbase/table-lock/TestTable/write-master:600000000000002 > 2013-04-03 17:21:21,712 DEBUG > org.apache.hadoop.hbase.master.TableLockManager: Released table lock on > :TestTable > ------ > (Again started creating the table) > > 2013-04-03 17:25:23,955 DEBUG > org.apache.hadoop.hbase.master.TableLockManager: Attempt to acquire table > write lock on :TestTable for:C_M_CREATE_TABLE > 2013-04-03 17:25:23,961 DEBUG > org.apache.hadoop.hbase.zookeeper.lock.ZKInterProcessLockBase: Successfully > acquired a lock for /hbase/table-lock/TestTable/write-master:600000000000000 > 2013-04-03 17:25:23,961 DEBUG > org.apache.hadoop.hbase.master.TableLockManager: Acquired table write lock > on :TestTable for:C_M_CREATE_TABLE > 2013-04-03 17:25:23,963 DEBUG org.apache.hadoop.hbase.client.ClientScanner: > Creating scanner over .META. starting at key 'TestTable,,' > 2013-04-03 17:25:23,963 DEBUG org.apache.hadoop.hbase.client.ClientScanner: > Advancing internal scanner to startKey at 'TestTable,,' > 2013-04-03 17:25:52,928 DEBUG org.apache.hadoop.hbase.util.FSUtils: > hdfs://localhost:9010/hbase/.archive/TestTable/14ffdb436fae5eec5fc0ac0a8ca89c57/info/.links-86f33ddcb5db4f7ba91f543303491908 > doesn't exist > 2013-04-03 17:25:52,937 DEBUG org.apache.hadoop.hbase.util.FSUtils: > hdfs://localhost:9010/hbase/.archive/TestTable/e5e94a1c4adc038e6517e45bcbbab9a9/info/.links-be7b133a187c42b29cdd83ee555c8905 > doesn't exist > 2013-04-03 17:25:52,940 DEBUG org.apache.hadoop.hbase.util.FSUtils: > hdfs://localhost:9010/hbase/.archive/TestTable/e5e94a1c4adc038e6517e45bcbbab9a9/info/.links-f30d16210ff84957a8a6800538d6884e > doesn't exist > 2013-04-03 17:26:05,031 DEBUG > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer: Skipping > load balance as cluster has only one node. > 2013-04-03 17:26:05,032 DEBUG org.apache.hadoop.hbase.client.ClientScanner: > Creating scanner over .META. starting at key '' > 2013-04-03 17:26:05,032 DEBUG org.apache.hadoop.hbase.client.ClientScanner: > Advancing internal scanner to startKey at '' > > .................. > 2013-04-03 17:41:05,032 DEBUG > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer: Skipping > load balance as cluster has only one node. > 2013-04-03 17:41:24,456 WARN org.apache.hadoop.hbase.client.ServerCallable: > Received exception, tries=7, numRetries=100 message=Call to > ram.sh.intel.com/10.239.47.144:60020 failed on socket timeout exception: > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.144:59884 remote= > ram.sh.intel.com/10.239.47.144:60020] > 2013-04-03 17:41:24,456 WARN org.apache.hadoop.hbase.client.ServerCallable: > Received exception, tries=6, numRetries=100 message=Call to > ram.sh.intel.com/10.239.47.144:60020 failed on socket timeout exception: > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.144:59884 remote= > ram.sh.intel.com/10.239.47.144:60020] > 2013-04-03 17:42:24,473 WARN org.apache.hadoop.hbase.client.ServerCallable: > Received exception, tries=8, numRetries=100 message=Call to > ram.sh.intel.com/10.239.47.144:60020 failed on socket timeout exception: > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.144:59884 remote= > ram.sh.intel.com/10.239.47.144:60020] > 2013-04-03 17:43:24,533 WARN org.apache.hadoop.hbase.client.ServerCallable: > Received exception, tries=8, numRetries=100 message=Call to > ram.sh.intel.com/10.239.47.144:60020 failed on socket timeout exception: > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.144:59884 remote= > ram.sh.intel.com/10.239.47.144:60020] > 2013-04-03 17:43:24,534 WARN org.apache.hadoop.hbase.client.ServerCallable: > Received exception, tries=7, numRetries=100 message=Call to > ram.sh.intel.com/10.239.47.144:60020 failed on socket timeout exception: > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.144:59884 remote= > ram.sh.intel.com/10.239.47.144:60020] > 2013-04-03 17:44:24,559 WARN org.apache.hadoop.hbase.client.ServerCallable: > Received exception, tries=9, numRetries=100 message=Call to > ram.sh.intel.com/10.239.47.144:60020 failed on socket timeout exception: > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.144:59884 remote= > ram.sh.intel.com/10.239.47.144:60020] > 2013-04-03 17:45:24,578 WARN org.apache.hadoop.hbase.client.ServerCallable: > Received exception, tries=8, numRetries=100 message=Call to > ram.sh.intel.com/10.239.47.144:60020 failed on socket timeout exception: > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.144:59884 remote= > ram.sh.intel.com/10.239.47.144:60020] > 2013-04-03 17:45:24,578 WARN org.apache.hadoop.hbase.client.ServerCallable: > Received exception, tries=9, numRetries=100 message=Call to > ram.sh.intel.com/10.239.47.144:60020 failed on socket timeout exception: > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.144:59884 remote= > ram.sh.intel.com/10.239.47.144:60020] > 2013-04-03 17:46:05,032 DEBUG > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer: Skipping > load balance as cluster has only one node. > > (I got the above msg continuously) > {logs} > > The thread dump in master says > {code} > "IPC Server handler 9 on 60000" daemon prio=10 tid=0x00007fea8c51b000 > nid=0x1ad8 in Object.wait() [0x00007fea7121b000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1307) > - locked <0x000000077d3b6cb0> (a > org.apache.hadoop.hbase.ipc.HBaseClient$Call) > at > org.apache.hadoop.hbase.ipc.ProtobufRpcClientEngine$Invoker.invoke(ProtobufRpcClientEngine.java:131) > at $Proxy15.scan(Unknown Source) > at > org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:268) > at > org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:137) > at > org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:52) > at > org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:167) > at > org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:216) > at > org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:131) > at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:531) > at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:536) > at > org.apache.hadoop.hbase.catalog.MetaReader.tableExists(MetaReader.java:309) > at > org.apache.hadoop.hbase.master.handler.CreateTableHandler.prepare(CreateTableHandler.java:104) > at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1538) > at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1568) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > {code} > > The thread dump in the client says > {code} > at > org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:268) > at > org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:137) > at > org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:52) > at > org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:167) > at > org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:216) > at > org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:131) > at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:531) > at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:536) > at > org.apache.hadoop.hbase.catalog.MetaReader.tableExists(MetaReader.java:309) > at > org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:248) > at > org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:262) > at > org.apache.hadoop.hbase.PerformanceEvaluation.checkTable(PerformanceEvaluation.java:513) > > {code} > > I did not take the RS dump (!!! ). > The exception log > {code} > 2013-04-03 17:41:24,456 WARN org.apache.hadoop.hbase.client.ServerCallable: > Received exception, tries=7, numRetries=100 message=Call to > ram.sh.intel.com/10.239.47.144:60020 failed on socket timeout exception: > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.144:59884 remote= > ram.sh.intel.com/10.239.47.144:60020] > 2013-04-03 17:41:24,456 WARN org.apache.hadoop.hbase.client.ServerCallable: > Received exception, tries=6, numRetries=100 message=Call to > ram.sh.intel.com/10.239.47.144:60020 failed on socket timeout exception: > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.144:59884 remote= > ram.sh.intel.com/10.239.47.144:60020] > 2013-04-03 17:42:24,473 WARN org.apache.hadoop.hbase.client.ServerCallable: > Received exception, tries=8, numRetries=100 message=Call to > ram.sh.intel.com/10.239.47.144:60020 failed on socket timeout exception: > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.144:59884 remote= > ram.sh.intel.com/10.239.47.144:60020] > > {code} > Does this mean that the RS process on the same node could not respond? > Seems weird to me because both the process are on the same machine. > > Any one has seen this problem? Will try to dig in more and update here if > i reproduce the issue. > > Regards > Ram
