There were a lot of problems with Hadoop pre 0.20.2 for clusters smaller than 10, especially 3 when having node failure. If you are talking about just region servers, you are using 0.20.2 and 0.20.3 has stability fixes.
J-D On Wed, Mar 3, 2010 at 12:41 PM, Bluemetrix Development <bmdevelopm...@gmail.com> wrote: > For completeness sake, I'll update here. > The issue with shell counts and rowcounter crashing were fixed by upping > - open files to 32K (ulimit -n) > - dfs.datanode.max.xcievers to 2048 > (I had overlooked this when moving to a larger cluster) > > As for recovering from crashes, I haven't had much luck. > I'm only running a 3 server cluster so that may be an issue, > but when one server goes down, it doesn't seem to be too easy > to recover the Hbase table data after getting everything restarted again. > I've usually had to wipe hdfs and start from scratch. > > On Wed, Feb 17, 2010 at 12:59 PM, Bluemetrix Development > <bmdevelopm...@gmail.com> wrote: >> Hi, Thanks for the suggestions. I'll make note of this. >> (I've decided to reinsert, as with time constraints it is probably >> quicker than trying to debug and recover.) >> So, I guess I am more concerned about trying to prevent this from >> happening again. >> Is it possible that a shell count caused enough load to crash hbase? >> Or that nodes becoming unavailable due to heavy network load could >> cause data corruption? >> >> On Wed, Feb 17, 2010 at 12:42 PM, Michael Segel >> <michael_se...@hotmail.com> wrote: >>> >>> Try this... >>> >>> 1 run hadoop fsck / >>> 2 shut down hbase >>> 3 mv /hbase to /hbase.old >>> 4 restart /hbase (optional just for a sanity check) >>> 5 copy /hbase.old back to /hbase >>> 6 restart >>> >>> This may not help, but it can't hurt. >>> Depending on the size of your hbase database, it could take a while. On our >>> sandbox, we suffer from zookeeper and hbase failures when there's a heavy >>> load on the network. (Don't ask, the sandbox was just a play area on >>> whatever hardware we could find.) Doing this copy cleaned up a database >>> that wouldn't fully come up. May do the same for you. >>> >>> HTH >>> >>> -Mike >>> >>> >>>> Date: Wed, 17 Feb 2010 10:50:59 -0500 >>>> Subject: Re: hbase shell count crashes >>>> From: bmdevelopm...@gmail.com >>>> To: hbase-user@hadoop.apache.org >>>> >>>> Hi, >>>> So after a few more attempts and crashes from trying the shell count, >>>> I ran the MR rowcounter and noticed that the number of rows were less >>>> than they should have been - even on smaller test tables. >>>> This led me to start looking through the logs and perform a few >>>> compacts on META and restarts of hbase. Unfortunately, now two tables >>>> are entirely missing - no longer show up under the shell list command. >>>> >>>> I'm not entirely sure what to look for in the logs, but I've noticed a >>>> lot of this in the master log. >>>> >>>> 2010-02-16 23:59:25,856 WARN org.apache.hadoop.hbase.master.HMaster: >>>> info:regioninfo is empty for row: >>>> UserData_0209,e834d76faddee14b,1266316478685; has keys: info:server, >>>> info:serverstartcode >>>> >>>> Came across this in the regionserver log: >>>> 2010-02-16 23:58:33,851 WARN >>>> org.apache.hadoop.hbase.regionserver.Store: Skipping >>>> hdfs://upp1.bmeu.com:50001/hbase/.META./1028785192/info/4080287239754005013 >>>> because its empty. HBASE-646 DATA LOSS? >>>> >>>> Any ideas if the tables are recoverable? Its not a big deal for me to >>>> re-insert from scratch as this is still in testing phase, >>>> but would be curious to find out what has led to these issues in order >>>> to possibly fix or at least not repeat. >>>> >>>> Thanks >>>> >>>> On Tue, Feb 16, 2010 at 2:43 PM, Bluemetrix Development >>>> <bmdevelopm...@gmail.com> wrote: >>>> > Hi, Thanks for the explanation. >>>> > >>>> > Yes, I was able to cat the file from all three of my region servers: >>>> > hadoop fs -cat /hbase/.META./1028785192/info/8254845156484129698 > >>>> > tmp.out >>>> > >>>> > I have never came across this before, but this is the first time I've >>>> > had 7M rows in the db. >>>> > Is there anything going on that would bog down the network and cause >>>> > this file to be unreachable? >>>> > >>>> > I have 3 servers. The master is running the jobtracker, namenode and >>>> > hmaster. >>>> > And all 3 are running datanodes, regionservers and zookeeper. >>>> > >>>> > Appreciate the help. >>>> > >>>> > On Tue, Feb 16, 2010 at 2:11 PM, Jean-Daniel Cryans >>>> > <jdcry...@apache.org> wrote: >>>> >> This line >>>> >> java.io.IOException: java.io.IOException: Could not obtain block: >>>> >> blk_-6288142015045035704_88516 >>>> >> file=/hbase/.META./1028785192/info/8254845156484129698 >>>> >> >>>> >> Means that the region server wasn't able to fetch a block for the .META. >>>> >> table (the table where all region addresses are). Are you able to open >>>> >> that >>>> >> file using the bin/hadoop command line utility? >>>> >> >>>> >> J-D >>>> >> >>>> >> On Tue, Feb 16, 2010 at 11:08 AM, Bluemetrix Development < >>>> >> bmdevelopm...@gmail.com> wrote: >>>> >> >>>> >>> Hi, >>>> >>> I'm currently trying to run a count in hbase shell and it crashes >>>> >>> right towards the end. >>>> >>> This is turn seems to crash hbase or at least causes the regionservers >>>> >>> to become unavailable. >>>> >>> >>>> >>> Here's the tail end of the count output: >>>> >>> http://pastebin.com/m465346d0 >>>> >>> >>>> >>> I'm on version 0.20.2 and running this command: >>>> >>> > count 'table', 1000000 >>>> >>> >>>> >>> Anyone with similar issues or ideas on this? >>>> >>> Please let me know if you need further info. >>>> >>> Thanks >>>> >>> >>>> >> >>>> > >>> >>> _________________________________________________________________ >>> Hotmail: Trusted email with powerful SPAM protection. >>> http://clk.atdmt.com/GBL/go/201469227/direct/01/ >> >