Yes, upgrading to 0.20.3 should be added to my list above. I have since done this. Thanks very much for that.
On Wed, Mar 3, 2010 at 4:44 PM, Jean-Daniel Cryans <jdcry...@apache.org> wrote: > There were a lot of problems with Hadoop pre 0.20.2 for clusters > smaller than 10, especially 3 when having node failure. If you are > talking about just region servers, you are using 0.20.2 and 0.20.3 has > stability fixes. > > J-D > > On Wed, Mar 3, 2010 at 12:41 PM, Bluemetrix Development > <bmdevelopm...@gmail.com> wrote: >> For completeness sake, I'll update here. >> The issue with shell counts and rowcounter crashing were fixed by upping >> - open files to 32K (ulimit -n) >> - dfs.datanode.max.xcievers to 2048 >> (I had overlooked this when moving to a larger cluster) >> >> As for recovering from crashes, I haven't had much luck. >> I'm only running a 3 server cluster so that may be an issue, >> but when one server goes down, it doesn't seem to be too easy >> to recover the Hbase table data after getting everything restarted again. >> I've usually had to wipe hdfs and start from scratch. >> >> On Wed, Feb 17, 2010 at 12:59 PM, Bluemetrix Development >> <bmdevelopm...@gmail.com> wrote: >>> Hi, Thanks for the suggestions. I'll make note of this. >>> (I've decided to reinsert, as with time constraints it is probably >>> quicker than trying to debug and recover.) >>> So, I guess I am more concerned about trying to prevent this from >>> happening again. >>> Is it possible that a shell count caused enough load to crash hbase? >>> Or that nodes becoming unavailable due to heavy network load could >>> cause data corruption? >>> >>> On Wed, Feb 17, 2010 at 12:42 PM, Michael Segel >>> <michael_se...@hotmail.com> wrote: >>>> >>>> Try this... >>>> >>>> 1 run hadoop fsck / >>>> 2 shut down hbase >>>> 3 mv /hbase to /hbase.old >>>> 4 restart /hbase (optional just for a sanity check) >>>> 5 copy /hbase.old back to /hbase >>>> 6 restart >>>> >>>> This may not help, but it can't hurt. >>>> Depending on the size of your hbase database, it could take a while. On >>>> our sandbox, we suffer from zookeeper and hbase failures when there's a >>>> heavy load on the network. (Don't ask, the sandbox was just a play area on >>>> whatever hardware we could find.) Doing this copy cleaned up a database >>>> that wouldn't fully come up. May do the same for you. >>>> >>>> HTH >>>> >>>> -Mike >>>> >>>> >>>>> Date: Wed, 17 Feb 2010 10:50:59 -0500 >>>>> Subject: Re: hbase shell count crashes >>>>> From: bmdevelopm...@gmail.com >>>>> To: hbase-user@hadoop.apache.org >>>>> >>>>> Hi, >>>>> So after a few more attempts and crashes from trying the shell count, >>>>> I ran the MR rowcounter and noticed that the number of rows were less >>>>> than they should have been - even on smaller test tables. >>>>> This led me to start looking through the logs and perform a few >>>>> compacts on META and restarts of hbase. Unfortunately, now two tables >>>>> are entirely missing - no longer show up under the shell list command. >>>>> >>>>> I'm not entirely sure what to look for in the logs, but I've noticed a >>>>> lot of this in the master log. >>>>> >>>>> 2010-02-16 23:59:25,856 WARN org.apache.hadoop.hbase.master.HMaster: >>>>> info:regioninfo is empty for row: >>>>> UserData_0209,e834d76faddee14b,1266316478685; has keys: info:server, >>>>> info:serverstartcode >>>>> >>>>> Came across this in the regionserver log: >>>>> 2010-02-16 23:58:33,851 WARN >>>>> org.apache.hadoop.hbase.regionserver.Store: Skipping >>>>> hdfs://upp1.bmeu.com:50001/hbase/.META./1028785192/info/4080287239754005013 >>>>> because its empty. HBASE-646 DATA LOSS? >>>>> >>>>> Any ideas if the tables are recoverable? Its not a big deal for me to >>>>> re-insert from scratch as this is still in testing phase, >>>>> but would be curious to find out what has led to these issues in order >>>>> to possibly fix or at least not repeat. >>>>> >>>>> Thanks >>>>> >>>>> On Tue, Feb 16, 2010 at 2:43 PM, Bluemetrix Development >>>>> <bmdevelopm...@gmail.com> wrote: >>>>> > Hi, Thanks for the explanation. >>>>> > >>>>> > Yes, I was able to cat the file from all three of my region servers: >>>>> > hadoop fs -cat /hbase/.META./1028785192/info/8254845156484129698 > >>>>> > tmp.out >>>>> > >>>>> > I have never came across this before, but this is the first time I've >>>>> > had 7M rows in the db. >>>>> > Is there anything going on that would bog down the network and cause >>>>> > this file to be unreachable? >>>>> > >>>>> > I have 3 servers. The master is running the jobtracker, namenode and >>>>> > hmaster. >>>>> > And all 3 are running datanodes, regionservers and zookeeper. >>>>> > >>>>> > Appreciate the help. >>>>> > >>>>> > On Tue, Feb 16, 2010 at 2:11 PM, Jean-Daniel Cryans >>>>> > <jdcry...@apache.org> wrote: >>>>> >> This line >>>>> >> java.io.IOException: java.io.IOException: Could not obtain block: >>>>> >> blk_-6288142015045035704_88516 >>>>> >> file=/hbase/.META./1028785192/info/8254845156484129698 >>>>> >> >>>>> >> Means that the region server wasn't able to fetch a block for the >>>>> >> .META. >>>>> >> table (the table where all region addresses are). Are you able to open >>>>> >> that >>>>> >> file using the bin/hadoop command line utility? >>>>> >> >>>>> >> J-D >>>>> >> >>>>> >> On Tue, Feb 16, 2010 at 11:08 AM, Bluemetrix Development < >>>>> >> bmdevelopm...@gmail.com> wrote: >>>>> >> >>>>> >>> Hi, >>>>> >>> I'm currently trying to run a count in hbase shell and it crashes >>>>> >>> right towards the end. >>>>> >>> This is turn seems to crash hbase or at least causes the regionservers >>>>> >>> to become unavailable. >>>>> >>> >>>>> >>> Here's the tail end of the count output: >>>>> >>> http://pastebin.com/m465346d0 >>>>> >>> >>>>> >>> I'm on version 0.20.2 and running this command: >>>>> >>> > count 'table', 1000000 >>>>> >>> >>>>> >>> Anyone with similar issues or ideas on this? >>>>> >>> Please let me know if you need further info. >>>>> >>> Thanks >>>>> >>> >>>>> >> >>>>> > >>>> >>>> _________________________________________________________________ >>>> Hotmail: Trusted email with powerful SPAM protection. >>>> http://clk.atdmt.com/GBL/go/201469227/direct/01/ >>> >> >