Mmm then you might be hitting http://issues.apache.org/jira/browse/HBASE-2244
As you can see we are working hard to stabilize HBase as much as possible ;) J-D On Wed, Mar 3, 2010 at 2:56 PM, Bluemetrix Development <bmdevelopm...@gmail.com> wrote: > Yes, upgrading to 0.20.3 should be added to my list above. I have > since done this. > Thanks very much for that. > > On Wed, Mar 3, 2010 at 4:44 PM, Jean-Daniel Cryans <jdcry...@apache.org> > wrote: >> There were a lot of problems with Hadoop pre 0.20.2 for clusters >> smaller than 10, especially 3 when having node failure. If you are >> talking about just region servers, you are using 0.20.2 and 0.20.3 has >> stability fixes. >> >> J-D >> >> On Wed, Mar 3, 2010 at 12:41 PM, Bluemetrix Development >> <bmdevelopm...@gmail.com> wrote: >>> For completeness sake, I'll update here. >>> The issue with shell counts and rowcounter crashing were fixed by upping >>> - open files to 32K (ulimit -n) >>> - dfs.datanode.max.xcievers to 2048 >>> (I had overlooked this when moving to a larger cluster) >>> >>> As for recovering from crashes, I haven't had much luck. >>> I'm only running a 3 server cluster so that may be an issue, >>> but when one server goes down, it doesn't seem to be too easy >>> to recover the Hbase table data after getting everything restarted again. >>> I've usually had to wipe hdfs and start from scratch. >>> >>> On Wed, Feb 17, 2010 at 12:59 PM, Bluemetrix Development >>> <bmdevelopm...@gmail.com> wrote: >>>> Hi, Thanks for the suggestions. I'll make note of this. >>>> (I've decided to reinsert, as with time constraints it is probably >>>> quicker than trying to debug and recover.) >>>> So, I guess I am more concerned about trying to prevent this from >>>> happening again. >>>> Is it possible that a shell count caused enough load to crash hbase? >>>> Or that nodes becoming unavailable due to heavy network load could >>>> cause data corruption? >>>> >>>> On Wed, Feb 17, 2010 at 12:42 PM, Michael Segel >>>> <michael_se...@hotmail.com> wrote: >>>>> >>>>> Try this... >>>>> >>>>> 1 run hadoop fsck / >>>>> 2 shut down hbase >>>>> 3 mv /hbase to /hbase.old >>>>> 4 restart /hbase (optional just for a sanity check) >>>>> 5 copy /hbase.old back to /hbase >>>>> 6 restart >>>>> >>>>> This may not help, but it can't hurt. >>>>> Depending on the size of your hbase database, it could take a while. On >>>>> our sandbox, we suffer from zookeeper and hbase failures when there's a >>>>> heavy load on the network. (Don't ask, the sandbox was just a play area >>>>> on whatever hardware we could find.) Doing this copy cleaned up a >>>>> database that wouldn't fully come up. May do the same for you. >>>>> >>>>> HTH >>>>> >>>>> -Mike >>>>> >>>>> >>>>>> Date: Wed, 17 Feb 2010 10:50:59 -0500 >>>>>> Subject: Re: hbase shell count crashes >>>>>> From: bmdevelopm...@gmail.com >>>>>> To: hbase-user@hadoop.apache.org >>>>>> >>>>>> Hi, >>>>>> So after a few more attempts and crashes from trying the shell count, >>>>>> I ran the MR rowcounter and noticed that the number of rows were less >>>>>> than they should have been - even on smaller test tables. >>>>>> This led me to start looking through the logs and perform a few >>>>>> compacts on META and restarts of hbase. Unfortunately, now two tables >>>>>> are entirely missing - no longer show up under the shell list command. >>>>>> >>>>>> I'm not entirely sure what to look for in the logs, but I've noticed a >>>>>> lot of this in the master log. >>>>>> >>>>>> 2010-02-16 23:59:25,856 WARN org.apache.hadoop.hbase.master.HMaster: >>>>>> info:regioninfo is empty for row: >>>>>> UserData_0209,e834d76faddee14b,1266316478685; has keys: info:server, >>>>>> info:serverstartcode >>>>>> >>>>>> Came across this in the regionserver log: >>>>>> 2010-02-16 23:58:33,851 WARN >>>>>> org.apache.hadoop.hbase.regionserver.Store: Skipping >>>>>> hdfs://upp1.bmeu.com:50001/hbase/.META./1028785192/info/4080287239754005013 >>>>>> because its empty. HBASE-646 DATA LOSS? >>>>>> >>>>>> Any ideas if the tables are recoverable? Its not a big deal for me to >>>>>> re-insert from scratch as this is still in testing phase, >>>>>> but would be curious to find out what has led to these issues in order >>>>>> to possibly fix or at least not repeat. >>>>>> >>>>>> Thanks >>>>>> >>>>>> On Tue, Feb 16, 2010 at 2:43 PM, Bluemetrix Development >>>>>> <bmdevelopm...@gmail.com> wrote: >>>>>> > Hi, Thanks for the explanation. >>>>>> > >>>>>> > Yes, I was able to cat the file from all three of my region servers: >>>>>> > hadoop fs -cat /hbase/.META./1028785192/info/8254845156484129698 > >>>>>> > tmp.out >>>>>> > >>>>>> > I have never came across this before, but this is the first time I've >>>>>> > had 7M rows in the db. >>>>>> > Is there anything going on that would bog down the network and cause >>>>>> > this file to be unreachable? >>>>>> > >>>>>> > I have 3 servers. The master is running the jobtracker, namenode and >>>>>> > hmaster. >>>>>> > And all 3 are running datanodes, regionservers and zookeeper. >>>>>> > >>>>>> > Appreciate the help. >>>>>> > >>>>>> > On Tue, Feb 16, 2010 at 2:11 PM, Jean-Daniel Cryans >>>>>> > <jdcry...@apache.org> wrote: >>>>>> >> This line >>>>>> >> java.io.IOException: java.io.IOException: Could not obtain block: >>>>>> >> blk_-6288142015045035704_88516 >>>>>> >> file=/hbase/.META./1028785192/info/8254845156484129698 >>>>>> >> >>>>>> >> Means that the region server wasn't able to fetch a block for the >>>>>> >> .META. >>>>>> >> table (the table where all region addresses are). Are you able to >>>>>> >> open that >>>>>> >> file using the bin/hadoop command line utility? >>>>>> >> >>>>>> >> J-D >>>>>> >> >>>>>> >> On Tue, Feb 16, 2010 at 11:08 AM, Bluemetrix Development < >>>>>> >> bmdevelopm...@gmail.com> wrote: >>>>>> >> >>>>>> >>> Hi, >>>>>> >>> I'm currently trying to run a count in hbase shell and it crashes >>>>>> >>> right towards the end. >>>>>> >>> This is turn seems to crash hbase or at least causes the >>>>>> >>> regionservers >>>>>> >>> to become unavailable. >>>>>> >>> >>>>>> >>> Here's the tail end of the count output: >>>>>> >>> http://pastebin.com/m465346d0 >>>>>> >>> >>>>>> >>> I'm on version 0.20.2 and running this command: >>>>>> >>> > count 'table', 1000000 >>>>>> >>> >>>>>> >>> Anyone with similar issues or ideas on this? >>>>>> >>> Please let me know if you need further info. >>>>>> >>> Thanks >>>>>> >>> >>>>>> >> >>>>>> > >>>>> >>>>> _________________________________________________________________ >>>>> Hotmail: Trusted email with powerful SPAM protection. >>>>> http://clk.atdmt.com/GBL/go/201469227/direct/01/ >>>> >>> >> >