I posted the master log up on a web server: http://assets0.pubget.com/data/hbase-pubget-master-carr.projectlounge.com.log.2009-12-02
The crash happened around 22:00, though I can start to see a few exceptions at 21:27, continuing on through the night. Again, the power took out all of the nodes, but left the master in tact, so it looks like most of the log is the master trying to regain communication with the nodes. Would love to see any insight you have into the mystery. Thanks, Mike On Thu, Dec 3, 2009 at 4:42 PM, Jean-Daniel Cryans <jdcry...@apache.org>wrote: > Mike, > > I'm glad it worked out for you! And I'm curious too, this shouldn't be > happening. I'd love to take take a look at your master's log from the > day of the failure. You could put it on a web server or try to attach > it to a reply (but that usually gets filtered). > > J-D > > On Thu, Dec 3, 2009 at 1:23 PM, mike anderson <saidthero...@gmail.com> > wrote: > > wow! Thanks for all your help. I just took the add_table.rb script for a > run > > and it worked flawlessly. Kudos to the community! > > > > I'm still curious as to what might have happened? Was the .META. table > just > > slightly out of whack? > > > > -mike > > > > On Thu, Dec 3, 2009 at 3:36 PM, mike anderson <saidthero...@gmail.com > >wrote: > > > >> This was a table that had been around for almost two months now and had > >> many regions. The web UI reports 231 regions, and I am certain that the > >> tables being reported don't have nearly that many regions, so perhaps > this > >> count includes those from the missing table. > >> > >> In the folder: /hbase/cached_web_pages/1102708773/http is a single 130MB > >> file full of rows/columns. We are caching the full html of websites into > the > >> columns so copying and pasting some of the rows won't be very useful, > but > >> the chunk starts with this: > >> > >> "DATABLK*f #ŸRhttp%3A%2F%2Fwww.informaworld.com > %2Fsmpp%2Ftitle%7Edb%3Dall%7Econtent%3Dg903750466 > >> httpdata $í ó " > >> > >> I tried to enable a region, but get: > >> > >> from (hbase):3hbase(main):003:0> enable_region > >> 'cached_web_pages,metapress_ris_120417,1257429337740' > >> NativeException: java.lang.NullPointerException: null > >> from org/apache/hadoop/hbase/util/Writables.java:74:in `getWritable' > >> from sun/reflect/NativeMethodAccessorImpl.java:-2:in `invoke0' > >> from sun/reflect/NativeMethodAccessorImpl.java:39:in `invoke' > >> from sun/reflect/DelegatingMethodAccessorImpl.java:25:in `invoke' > >> from java/lang/reflect/Method.java:597:in `invoke' > >> from org/jruby/javasupport/JavaMethod.java:298:in > >> `invokeWithExceptionHandling' > >> from org/jruby/javasupport/JavaMethod.java:278:in `invoke_static' > >> from org/jruby/java/invokers/StaticMethodInvoker.java:57:in `call' > >> from org/jruby/runtime/callsite/CachingCallSite.java:150:in `call' > >> from org/jruby/ast/CallTwoArgNode.java:59:in `interpret' > >> from org/jruby/ast/LocalAsgnNode.java:123:in `interpret' > >> from org/jruby/ast/NewlineNode.java:104:in `interpret' > >> from org/jruby/ast/BlockNode.java:71:in `interpret' > >> from org/jruby/internal/runtime/methods/InterpretedMethod.java:201:in > >> `call' > >> from org/jruby/internal/runtime/methods/DefaultMethod.java:162:in > `call' > >> from org/jruby/runtime/callsite/CachingCallSite.java:150:in `call' > >> ... 112 levels... > >> from org/jruby/internal/runtime/methods/DynamicMethod.java:226:in `call' > >> from org/jruby/internal/runtime/methods/CompiledMethod.java:211:in > `call' > >> from org/jruby/internal/runtime/methods/CompiledMethod.java:71:in > `call' > >> from org/jruby/runtime/callsite/CachingCallSite.java:253:in > `cacheAndCall' > >> from org/jruby/runtime/callsite/CachingCallSite.java:72:in `call' > >> from usr/local/hbase/bin/$_dot_dot_/bin/hirb.rb:487:in `__file__' > >> from usr/local/hbase/bin/$_dot_dot_/bin/hirb.rb:-1:in `load' > >> from org/jruby/Ruby.java:577:in `runScript' > >> from org/jruby/Ruby.java:480:in `runNormally' > >> from org/jruby/Ruby.java:354:in `runFromMain' > >> from org/jruby/Main.java:229:in `run' > >> from org/jruby/Main.java:110:in `run' > >> from org/jruby/Main.java:94:in `main' > >> from /usr/local/hbase/bin/../bin/HBase.rb:138:in `enable_region' > >> from /usr/local/hbase/bin/../bin/hirb.rb:350:in `enable_region' > >> from (hbase):4hbase(main):004:0> > >> > >> Thanks again. > >> > >> -mike > >> > >> On Thu, Dec 3, 2009 at 3:21 PM, Jean-Daniel Cryans <jdcry...@apache.org > >wrote: > >> > >>> What's in the HDFS folder of that table? Here I see that you should > >>> have something like: > >>> > >>> /hbase/cached_web_pages/1325672518/http/ stuff... > >>> > >>> Was there only this one region? > >>> > >>> Also are you able to enable a region in the shell? Take one of the row > >>> key from .META. and do > >>> > >>> > enable_region 'region name' > >>> > >>> J-D > >>> > >>> On Thu, Dec 3, 2009 at 12:11 PM, mike anderson <saidthero...@gmail.com > > > >>> wrote: > >>> > Here's a snippit from the meta table (I can send you the whole thing, > >>> but > >>> > it's quite large), > >>> > > >>> > cached_web_pages,http%3A%2F column=info:serverstartcode, > >>> > timestamp=1259853027975, value=1259852967063 > >>> > %2Fdx.doi.org%2F10.1002%252 > >>> > > >>> > Fajpa.21214,1259739437144 > >>> > > >>> > cached_web_pages,http%3A%2F column=historian:assignment, > >>> > timestamp=1259807436758, value=Region assigned to se > >>> > %2Fdx.doi.org%2F10.1002%252 rver > >>> > ghetto169.projectlounge.com,60020,1256139356112 > >>> > > >>> > Fejoc.200900768,12555040994 > >>> > > >>> > 35 > >>> > > >>> > cached_web_pages,http%3A%2F column=historian:open, > >>> timestamp=1259807436723, > >>> > value=Region opened on server : g > >>> > %2Fdx.doi.org%2F10.1002%252 hetto169.projectlounge.com > >>> > > >>> > Fejoc.200900768,12555040994 > >>> > > >>> > 35 > >>> > > >>> > cached_web_pages,http%3A%2F column=historian:assignment, > >>> > timestamp=1259853024917, value=Region assigned to se > >>> > %2Fdx.doi.org%2F10.1002%252 rver > >>> > ghetto167.projectlounge.com,60020,1259852967063 > >>> > > >>> > Fsmi.1285,1258589376676 > >>> > > >>> > cached_web_pages,http%3A%2F column=historian:open, > >>> timestamp=1259853027984, > >>> > value=Region opened on server : g > >>> > %2Fdx.doi.org%2F10.1002%252 hetto167.projectlounge.com > >>> > > >>> > Fsmi.1285,1258589376676 > >>> > > >>> > cached_web_pages,http%3A%2F column=info:regioninfo, > >>> > timestamp=1258589203875, value=REGION => {NAME => 'cached > >>> > %2Fdx.doi.org%2F10.1002%252 _web_pages,http\\x253A\\x252F\\ > >>> x252Fdx.doi.org > >>> > \\x252F10.1002\\x25252Fsmi.1285,125 > >>> > Fsmi.1285,1258589376676 8589376676', STARTKEY => > >>> 'http\\x253A\\x252F\\ > >>> > x252Fdx.doi.org\\x252F10.1002\\x252 > >>> > 52Fsmi.1285', ENDKEY => > >>> 'http\\x253A\\x252F\\ > >>> > x252Fdx.doi.org\\x252F10.1016\\x252F > >>> > j.apergo.2009.09.005', ENCODED => > >>> 1325672518, > >>> > TABLE => {{NAME => 'cached_web_page > >>> > s', FAMILIES => [{NAME => 'http', > VERSIONS > >>> => > >>> > '1', COMPRESSION => 'NONE', TTL => > >>> > '2147483647', BLOCKSIZE => '65536', > >>> IN_MEMORY > >>> > => 'false', BLOCKCACHE => 'true'}]} > >>> > } > >>> > > >>> > > >>> > and you can see the table which has gone missing 'cached_web_pages' > in > >>> the > >>> > key spot. The crash over the weekend was pretty traumatic. Complete > >>> power > >>> > outage to the entire cluster except(!) for the master. The data is > >>> > definitely still on HDFS, I will take a look at the add_table script > and > >>> > upgrade to 0.20.2. > >>> > > >>> > > >>> > Cheers and thanks a lot. > >>> > > >>> > mike > >>> > > >>> > > >>> > On Thu, Dec 3, 2009 at 2:51 PM, Jean-Daniel Cryans < > jdcry...@apache.org > >>> >wrote: > >>> > > >>> >> This is weird if the table is in .META. and still not showing up... > >>> >> could you pastebin the .META. rows? > >>> >> > >>> >> Also was it a new table that was just created or has it been there > for > >>> >> some time? > >>> >> > >>> >> What kind of crash did you get this weekend? > >>> >> > >>> >> The best way to recover your data, if it's still on HDFS, will be to > >>> >> upgrade to 0.20.2 and use the script bin/add_table.rb to rebuild > >>> >> .META. > >>> >> > >>> >> J-D > >>> >> > >>> >> On Thu, Dec 3, 2009 at 11:29 AM, mike anderson < > saidthero...@gmail.com > >>> > > >>> >> wrote: > >>> >> > From the web UI and from calling 'list' in the shell I can't see > the > >>> >> table > >>> >> > name. > >>> >> > > >>> >> > Hadoop/Hbase 0.20/0.20.1, distributed setup, 10 nodes. > >>> >> > > >>> >> > -mike > >>> >> > > >>> >> > On Thu, Dec 3, 2009 at 1:54 PM, Jean-Daniel Cryans < > >>> jdcry...@apache.org > >>> >> >wrote: > >>> >> > > >>> >> >> Mike, > >>> >> >> > >>> >> >> So if you looked in .META. and the rows are there, how did you > >>> figure > >>> >> >> that the table is missing? > >>> >> >> > >>> >> >> Also the usuals: which version of Hadoop/HBase, what kind of > setup, > >>> etc > >>> >> >> > >>> >> >> J-D > >>> >> >> > >>> >> >> On Thu, Dec 3, 2009 at 7:29 AM, mike anderson < > >>> saidthero...@gmail.com> > >>> >> >> wrote: > >>> >> >> > Hbase crashed on me this weekend, and upon restarting one of > the > >>> >> tables > >>> >> >> is > >>> >> >> > just completely gone. All of the table data is still in HDFS > and > >>> my > >>> >> >> missing > >>> >> >> > table is still mentioned in .META.. I tried restarting hbase a > few > >>> >> times, > >>> >> >> > but the table didn't show up. What else can I do to debug this? > I > >>> >> looked > >>> >> >> > through the logs, but nothing really jumped out at me. Is there > >>> >> something > >>> >> >> I > >>> >> >> > should look for? > >>> >> >> > > >>> >> >> > I took a look at this ticket, > >>> >> >> > http://issues.apache.org/jira/browse/HBASE-1342, but don't > know > >>> >> enough > >>> >> >> about > >>> >> >> > the inner workings of hbase to make sense of it. > >>> >> >> > > >>> >> >> > > >>> >> >> > thanks in advance. > >>> >> >> > > >>> >> >> > >>> >> > > >>> >> > >>> > > >>> > >> > >> > > >