Hi All: After chatting with jdcryans in Irc channel, this is what we found out.
We are having 40 to 50 Column families in some table. Hbase 19 opens 2 to 3 X files than 20. Even though we have increased our file limit size , we are getting "Too Many files open" exception. So the client stops because of this exception. When client stops in middle you get ClosedChannelException in region servers. Hbase is better off with less column families . The suggestion was to prefix columns instead of column families. I have included the irc chat for your reference. Regards, -- Siva Jagadeesan sivajag_ <member:sivajag_>:Does anyone know about this error [11:13am]sivajag_ <member:sivajag_>:009-08-19 02:40:27,142 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call getRow([...@b73d3 [11:13am]sivajag_ <member:sivajag_>:9, [...@1471c1, null, 9223372036854775807, 10000, -1) from 10.254.154.95:48659: output error [11:13am]sivajag_ <member:sivajag_>:2009-08-19 02:40:27,142 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 4 on 60020 caught: java [11:13am]sivajag_ <member:sivajag_>:.nio.channels.ClosedChannelException [11:13am]sivajag_ <member:sivajag_>:at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) [11:13am]sivajag_ <member:sivajag_>:at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) [11:13am]sivajag_ <member:sivajag_>:at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1119) [11:13am]sivajag_ <member:sivajag_>:at org.apache.hadoop.hbase.ipc.HBaseServer.access$2000(HBaseServer.java:70) [11:13am]sivajag_ <member:sivajag_>:at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:610) [11:13am]sivajag_ <member:sivajag_>:at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:674) [11:13am]sivajag_ <member:sivajag_>:at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:940) [11:14am]sivajag_ <member:sivajag_>:I am using Hbase 0.19.3 Cluster ( 1 Master and 2 Region Nodes) [11:16am]jdcryans <member:jdcryans>:sivajag_ <member:sivajag_>: you are getting this in a region server? [11:17am]sivajag_ <member:sivajag_>:yes [11:18am]jdcryans <member:jdcryans>:seems that a client was doing a getRow and was somehow killed/stopped [11:19am]sivajag_ <member:sivajag_>:mmm [11:20am]sivajag_ <member:sivajag_>:It is a long running process that does getRow [11:20am]sivajag_ <member:sivajag_>:it ran fine for an hour .. than I started getting these errors [11:21am]jdcryans <member:jdcryans>:what do you get on the client side? [11:23am]sivajag_ <member:sivajag_>:2009-08-19 08:29:11: java.lang.reflect.UndeclaredThrowableException [11:23am]sivajag_ <member:sivajag_>:$Proxy0.findRootRegion(Unknown Source) [11:23am]sivajag_ <member:sivajag_>: org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:770) [11:23am]sivajag_ <member:sivajag_>: org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:465) [11:23am]sivajag_ <member:sivajag_>: org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:440) [11:23am]sivajag_ <member:sivajag_>: org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:515) [11:23am]sivajag_ <member:sivajag_>: org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:474) [11:23am]sivajag_ <member:sivajag_>: org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:440) [11:23am]sivajag_ <member:sivajag_>: org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:515) [11:23am]sivajag_ <member:sivajag_>: org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:478) [11:25am]jdcryans <member:jdcryans>:so you were only doing reads in hbase and all of a sudden that happened? [11:26am]sivajag_ <member:sivajag_>:we were doing reads from one table and write to another table [11:27am]jdcryans <member:jdcryans>:since you only have 3 nodes, could you pastbin the logs of the master and region servers? like 200-300 lines around when that happened [11:28am] RaySl <member:RaySl> joined the chat room. [11:28am]sivajag_ <member:sivajag_>:ok [11:28am]sivajag_ <member:sivajag_>:give me [11:29am]sivajag_ <member:sivajag_>:a minute [11:30am]jdcryans <member:jdcryans>:np [11:34am]sivajag_ <member:sivajag_>:http://pastie.org/588828 [11:34am]sivajag_ <member:sivajag_>:i have pasted log info at http://pastie.org/588828 [11:38am] rathore <member:rathore> joined the chat room. [11:38am]jdcryans <member:jdcryans>:sivajag_ <member:sivajag_>: there is nothing in the master log around 02:37:58? [11:39am]sivajag_ <member:sivajag_>:let me check [11:39am]sivajag_ <member:sivajag_>:2009-08-19 02:36:33,140 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scan of 1 row(s) of meta region {regionname: -ROOT-,,0, startKey: <>, server: 10.208.30.79:60020} complete [11:39am]sivajag_ <member:sivajag_>:2009-08-19 02:36:42,041 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scanning meta region {regionname: .META.,,1, startKey: <>, server: 10.208.30.79:60020} [11:39am]sivajag_ <member:sivajag_>:2009-08-19 02:36:42,130 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 22 row(s) of meta region {regionname: .META.,,1, startKey: <>, server: 10.208.30.79:60020} complete [11:39am]sivajag_ <member:sivajag_>:2009-08-19 02:36:42,130 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned [11:39am]sivajag_ <member:sivajag_>:2009-08-19 02:37:33,115 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scanning meta region {regionname: -ROOT-,,0, startKey: <>, server: 10.208.30.79:60020} [11:39am]sivajag_ <member:sivajag_>:2009-08-19 02:37:35,916 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scan of 1 row(s) of meta region {regionname: -ROOT-,,0, startKey: <>, server: 10.208.30.79:60020} complete [11:39am]sivajag_ <member:sivajag_>:2009-08-19 02:37:42,042 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scanning meta region {regionname: .META.,,1, startKey: <>, server: 10.208.30.79:60020} [11:39am]sivajag_ <member:sivajag_>:2009-08-19 02:37:43,803 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 22 row(s) of meta region {regionname: .META.,,1, startKey: <>, server: 10.208.30.79:60020} complete [11:39am]sivajag_ <member:sivajag_>:2009-08-19 02:37:43,804 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned [11:39am]sivajag_ <member:sivajag_>:2009-08-19 02:38:33,116 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scanning meta region {regionname: -ROOT-,,0, startKey: <>, server: 10.208.30.79:60020} [11:39am]sivajag_ <member:sivajag_>:2009-08-19 02:38:33,478 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scan of 1 row(s) of meta region {regionname: -ROOT-,,0, startKey: <>, server: 10.208.30.79:60020} complete [11:39am]sivajag_ <member:sivajag_>:2009-08-19 02:38:42,043 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scanning meta region {regionname: .META.,,1, startKey: <>, server: 10.208.30.79:60020} [11:39am]sivajag_ <member:sivajag_>:2009-08-19 02:38:42,109 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 22 row(s) of meta region {regionname: .META.,,1, startKey: <>, server: 10.208.30.79:60020} complete [11:39am]sivajag_ <member:sivajag_>:2009-08-19 02:38:42,109 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned [11:39am]sivajag_ <member:sivajag_>:2009-08-19 02:39:33,117 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scanning meta region {regionname: -ROOT-,,0, startKey: <>, server: 10.208.30.79:60020} [11:39am] You left the chat by being disconnected from the server. [11:40am] You rejoined the room. [11:42am]jdcryans <member:jdcryans>:sivajag_ <member:sivajag_>: it's better to pastebin every log [11:42am]jdcryans <member:jdcryans>:or you get easily kicked out [11:42am]jdcryans <member:jdcryans>:so is there any exception before that? [11:43am]sivajag_ <member:sivajag_>:hehe [11:43am]sivajag_ <member:sivajag_>:I will pastebin [11:43am]sivajag_ <member:sivajag_>:there is an exception in master [11:43am]sivajag_ <member:sivajag_>:let me pastebin [11:46am]sivajag_ <member:sivajag_>:pasted in the same place [11:46am]sivajag_ <member:sivajag_>:http://pastie.org/588828 [11:48am]jdcryans <member:jdcryans>:and there is nothing else before that seems wrong? [11:48am]sivajag_ <member:sivajag_>:no [11:49am]jdcryans <member:jdcryans>:that's weird... haven't seen this type of error much... [11:49am]sivajag_ <member:sivajag_>:the problem I am having is I am not sure whether it is my hbase setup or our code doing something funcky [11:49am]sivajag_ <member:sivajag_>:too many moving wheels [11:50am]jdcryans <member:jdcryans>:well I've always seen these errors when clients were getting killed, almost never in another context [11:51am] nitay <member:nitay> joined the chat room. [11:52am]MrBradford <member:identifier:mrbradford>:wow, you guys are right. IntelliJ is niiiiice [11:53am]jdcryans <member:jdcryans>:sivajag_ <member:sivajag_>: anything else on the client side apart from the java.lang.reflect.UndeclaredThrowableException? [11:53am]sivajag_ <member:sivajag_>:let me check [11:57am]sivajag_ <member:sivajag_>:jdcryans <member:jdcryans>: another one [11:57am]sivajag_ <member:sivajag_>:http://pastie.org/588828 [11:58am]jdcryans <member:jdcryans>:sivajag_ <member:sivajag_>: exception: java.io.IOException: Too many open files [12:00pm]jdcryans <member:jdcryans>:is this a 2.6.27 kernel? [12:00pm]jdcryans <member:jdcryans>:and did you change the nofile limit? [12:01pm]sivajag_ <member:sivajag_>:It is Fedora release 8 (Werewolf) [12:01pm]sivajag_ <member:sivajag_>:Kernel \r on an \m [12:01pm]sivajag_ <member:sivajag_>:i am not sure I did change the nofile limit [12:01pm]sivajag_ <member:sivajag_>:how do I check [12:02pm]sivajag_ <member:sivajag_>:what it is set currently [12:02pm]jgray <member:jgray>:ulimit -n [12:02pm]sivajag_ <member:sivajag_>:ulimit -n [12:02pm]sivajag_ <member:sivajag_>:32768 [12:03pm]sivajag_ <member:sivajag_>:in all my three boxes [12:04pm]MrBradford <member:identifier:mrbradford>:I that that's correct. [12:05pm]jdcryans <member:jdcryans>:sivajag_ <member:sivajag_>: and uname -r? [12:05pm]sivajag_ <member:sivajag_>:uname -r [12:05pm]sivajag_ <member:sivajag_>:2.6.21.7-2.fc8xen [12:05pm]jdcryans <member:jdcryans>:mm [12:06pm]jdcryans <member:jdcryans>:I would be surprised that you hit your nofile limit and the epoll limit wasn't in 2.6.21... [12:06pm]jdcryans <member:jdcryans>:do you have tons of families per table? [12:07pm]sivajag_ <member:sivajag_>:40 to 50 families per table [12:07pm]sivajag_ <member:sivajag_>:some have only like 10 [12:07pm]jgray <member:jgray>:that's a lot [12:07pm]jdcryans <member:jdcryans>:yep, that could be the problem [12:08pm]jdcryans <member:jdcryans>:why do you need so many families? [12:08pm]MrBradford <member:identifier:mrbradford>:Esp in .19, I'd imagine [12:08pm]jgray <member:jgray>:0.19 has 2-3X the number of open files as 0.20 [12:09pm]sivajag_ <member:sivajag_>:we need that many families for our business solution [12:09pm]jgray <member:jgray>:are you sure? [12:10pm]jgray <member:jgray>:it's a bit high... more than 10-20 is really pushing it, even those are high [12:10pm]jgray <member:jgray>:each family is basically another table, but grouped by row key [12:10pm]rathore <member:rathore>:we're using column names as data... and flattening 2-level object graphs into single hbase rows [12:11pm]jgray <member:jgray>:could u prefix your column names [12:11pm]rathore <member:rathore>:which is why a couple of tables have as many as 30-40 column families [12:11pm]jgray <member:jgray>:to get grouping [12:12pm]rathore <member:rathore>:so ur saying try to simulate "column-families" with a column name prefix [12:12pm]jgray <member:jgray>:perhaps [12:13pm]rathore <member:rathore>:hmm [12:13pm]rathore <member:rathore>:so hbase cant really have too many column families [12:13pm]rathore <member:rathore>:i was under the impressions that i could have thousands [12:13pm]jgray <member:jgray>:there's nothing saying u can't have 50 families, but it's beyond what we've targeted [12:13pm]rathore <member:rathore>:oh i see [12:13pm]jdcryans <member:jdcryans>:the practical limit would be 100 [12:13pm]rathore <member:rathore>:ok [12:13pm]jgray <member:jgray>:it's expensive though, in every way [12:13pm]rathore <member:rathore>:and thousands of columns (instead of families?) [12:14pm]jdcryans <member:jdcryans>:but you would need hbase-0.20 for 100 families [12:14pm]jgray <member:jgray>:yes, thousands of columns [12:14pm]rathore <member:rathore>:got it [12:14pm]jdcryans <member:jdcryans>:columns you can have millions [12:14pm]jgray <member:jgray>:0.19 supported 10s of thousands in a family [12:14pm]jgray <member:jgray>:0.20 supports millions [12:14pm]rathore <member:rathore>:so whats probably what our problem is [12:14pm]rathore <member:rathore>:ok [12:14pm]jdcryans <member:jdcryans>:btw you can check the number of open files with lsof | wc -l [12:15pm]rathore <member:rathore>:yes [12:15pm]rathore <member:rathore>:we're definitely seeing lots of open file handles [12:16pm]rathore <member:rathore>:thanks guys [12:17pm]rathore <member:rathore>:this was stupid design on our part then [12:17pm]jdcryans <member:jdcryans>:better to learn it now On Wed, Aug 19, 2009 at 11:08 AM, Siva Jagadeesan <[email protected]> wrote: > Hi All: > I am using Hbase 0.19.3 in cluster ( 1 master and 2 region nodes) > > I keep getting these errors > > > 009-08-19 02:40:27,142 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server > Responder, call getRow([...@b73d3 > 9, [...@1471c1, null, 9223372036854775807, 10000, -1) from > 10.254.154.95:48659: output error > 2009-08-19 02:40:27,142 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 4 on 60020 caught: java > .nio.channels.ClosedChannelException > at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) > at > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1119) > at org.apache.hadoop.hbase.ipc.HBaseServer.access$2000(HBaseServer.java:70) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:610) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:674) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:940) > > > Any idea? > > -- Siva Jagadeesan >
