Hi, Thanks for the suggestions. I'll make note of this. (I've decided to reinsert, as with time constraints it is probably quicker than trying to debug and recover.) So, I guess I am more concerned about trying to prevent this from happening again. Is it possible that a shell count caused enough load to crash hbase? Or that nodes becoming unavailable due to heavy network load could cause data corruption?
On Wed, Feb 17, 2010 at 12:42 PM, Michael Segel <michael_se...@hotmail.com> wrote: > > Try this... > > 1 run hadoop fsck / > 2 shut down hbase > 3 mv /hbase to /hbase.old > 4 restart /hbase (optional just for a sanity check) > 5 copy /hbase.old back to /hbase > 6 restart > > This may not help, but it can't hurt. > Depending on the size of your hbase database, it could take a while. On our > sandbox, we suffer from zookeeper and hbase failures when there's a heavy > load on the network. (Don't ask, the sandbox was just a play area on whatever > hardware we could find.) Doing this copy cleaned up a database that wouldn't > fully come up. May do the same for you. > > HTH > > -Mike > > >> Date: Wed, 17 Feb 2010 10:50:59 -0500 >> Subject: Re: hbase shell count crashes >> From: bmdevelopm...@gmail.com >> To: hbase-user@hadoop.apache.org >> >> Hi, >> So after a few more attempts and crashes from trying the shell count, >> I ran the MR rowcounter and noticed that the number of rows were less >> than they should have been - even on smaller test tables. >> This led me to start looking through the logs and perform a few >> compacts on META and restarts of hbase. Unfortunately, now two tables >> are entirely missing - no longer show up under the shell list command. >> >> I'm not entirely sure what to look for in the logs, but I've noticed a >> lot of this in the master log. >> >> 2010-02-16 23:59:25,856 WARN org.apache.hadoop.hbase.master.HMaster: >> info:regioninfo is empty for row: >> UserData_0209,e834d76faddee14b,1266316478685; has keys: info:server, >> info:serverstartcode >> >> Came across this in the regionserver log: >> 2010-02-16 23:58:33,851 WARN >> org.apache.hadoop.hbase.regionserver.Store: Skipping >> hdfs://upp1.bmeu.com:50001/hbase/.META./1028785192/info/4080287239754005013 >> because its empty. HBASE-646 DATA LOSS? >> >> Any ideas if the tables are recoverable? Its not a big deal for me to >> re-insert from scratch as this is still in testing phase, >> but would be curious to find out what has led to these issues in order >> to possibly fix or at least not repeat. >> >> Thanks >> >> On Tue, Feb 16, 2010 at 2:43 PM, Bluemetrix Development >> <bmdevelopm...@gmail.com> wrote: >> > Hi, Thanks for the explanation. >> > >> > Yes, I was able to cat the file from all three of my region servers: >> > hadoop fs -cat /hbase/.META./1028785192/info/8254845156484129698 > tmp.out >> > >> > I have never came across this before, but this is the first time I've >> > had 7M rows in the db. >> > Is there anything going on that would bog down the network and cause >> > this file to be unreachable? >> > >> > I have 3 servers. The master is running the jobtracker, namenode and >> > hmaster. >> > And all 3 are running datanodes, regionservers and zookeeper. >> > >> > Appreciate the help. >> > >> > On Tue, Feb 16, 2010 at 2:11 PM, Jean-Daniel Cryans <jdcry...@apache.org> >> > wrote: >> >> This line >> >> java.io.IOException: java.io.IOException: Could not obtain block: >> >> blk_-6288142015045035704_88516 >> >> file=/hbase/.META./1028785192/info/8254845156484129698 >> >> >> >> Means that the region server wasn't able to fetch a block for the .META. >> >> table (the table where all region addresses are). Are you able to open >> >> that >> >> file using the bin/hadoop command line utility? >> >> >> >> J-D >> >> >> >> On Tue, Feb 16, 2010 at 11:08 AM, Bluemetrix Development < >> >> bmdevelopm...@gmail.com> wrote: >> >> >> >>> Hi, >> >>> I'm currently trying to run a count in hbase shell and it crashes >> >>> right towards the end. >> >>> This is turn seems to crash hbase or at least causes the regionservers >> >>> to become unavailable. >> >>> >> >>> Here's the tail end of the count output: >> >>> http://pastebin.com/m465346d0 >> >>> >> >>> I'm on version 0.20.2 and running this command: >> >>> > count 'table', 1000000 >> >>> >> >>> Anyone with similar issues or ideas on this? >> >>> Please let me know if you need further info. >> >>> Thanks >> >>> >> >> >> > > > _________________________________________________________________ > Hotmail: Trusted email with powerful SPAM protection. > http://clk.atdmt.com/GBL/go/201469227/direct/01/