Thanks. I'll take a look at that in depth as soon as I have a chance. Seriously tho, brilliant work and thanks to all involved - its progressed such a great deal even in the last 9 months I've been following / using the product. Really enjoying it.
On Wed, Mar 3, 2010 at 5:58 PM, Jean-Daniel Cryans <jdcry...@apache.org> wrote: > Mmm then you might be hitting http://issues.apache.org/jira/browse/HBASE-2244 > > As you can see we are working hard to stabilize HBase as much as possible ;) > > J-D > > On Wed, Mar 3, 2010 at 2:56 PM, Bluemetrix Development > <bmdevelopm...@gmail.com> wrote: >> Yes, upgrading to 0.20.3 should be added to my list above. I have >> since done this. >> Thanks very much for that. >> >> On Wed, Mar 3, 2010 at 4:44 PM, Jean-Daniel Cryans <jdcry...@apache.org> >> wrote: >>> There were a lot of problems with Hadoop pre 0.20.2 for clusters >>> smaller than 10, especially 3 when having node failure. If you are >>> talking about just region servers, you are using 0.20.2 and 0.20.3 has >>> stability fixes. >>> >>> J-D >>> >>> On Wed, Mar 3, 2010 at 12:41 PM, Bluemetrix Development >>> <bmdevelopm...@gmail.com> wrote: >>>> For completeness sake, I'll update here. >>>> The issue with shell counts and rowcounter crashing were fixed by upping >>>> - open files to 32K (ulimit -n) >>>> - dfs.datanode.max.xcievers to 2048 >>>> (I had overlooked this when moving to a larger cluster) >>>> >>>> As for recovering from crashes, I haven't had much luck. >>>> I'm only running a 3 server cluster so that may be an issue, >>>> but when one server goes down, it doesn't seem to be too easy >>>> to recover the Hbase table data after getting everything restarted again. >>>> I've usually had to wipe hdfs and start from scratch. >>>> >>>> On Wed, Feb 17, 2010 at 12:59 PM, Bluemetrix Development >>>> <bmdevelopm...@gmail.com> wrote: >>>>> Hi, Thanks for the suggestions. I'll make note of this. >>>>> (I've decided to reinsert, as with time constraints it is probably >>>>> quicker than trying to debug and recover.) >>>>> So, I guess I am more concerned about trying to prevent this from >>>>> happening again. >>>>> Is it possible that a shell count caused enough load to crash hbase? >>>>> Or that nodes becoming unavailable due to heavy network load could >>>>> cause data corruption? >>>>> >>>>> On Wed, Feb 17, 2010 at 12:42 PM, Michael Segel >>>>> <michael_se...@hotmail.com> wrote: >>>>>> >>>>>> Try this... >>>>>> >>>>>> 1 run hadoop fsck / >>>>>> 2 shut down hbase >>>>>> 3 mv /hbase to /hbase.old >>>>>> 4 restart /hbase (optional just for a sanity check) >>>>>> 5 copy /hbase.old back to /hbase >>>>>> 6 restart >>>>>> >>>>>> This may not help, but it can't hurt. >>>>>> Depending on the size of your hbase database, it could take a while. On >>>>>> our sandbox, we suffer from zookeeper and hbase failures when there's a >>>>>> heavy load on the network. (Don't ask, the sandbox was just a play area >>>>>> on whatever hardware we could find.) Doing this copy cleaned up a >>>>>> database that wouldn't fully come up. May do the same for you. >>>>>> >>>>>> HTH >>>>>> >>>>>> -Mike >>>>>> >>>>>> >>>>>>> Date: Wed, 17 Feb 2010 10:50:59 -0500 >>>>>>> Subject: Re: hbase shell count crashes >>>>>>> From: bmdevelopm...@gmail.com >>>>>>> To: hbase-user@hadoop.apache.org >>>>>>> >>>>>>> Hi, >>>>>>> So after a few more attempts and crashes from trying the shell count, >>>>>>> I ran the MR rowcounter and noticed that the number of rows were less >>>>>>> than they should have been - even on smaller test tables. >>>>>>> This led me to start looking through the logs and perform a few >>>>>>> compacts on META and restarts of hbase. Unfortunately, now two tables >>>>>>> are entirely missing - no longer show up under the shell list command. >>>>>>> >>>>>>> I'm not entirely sure what to look for in the logs, but I've noticed a >>>>>>> lot of this in the master log. >>>>>>> >>>>>>> 2010-02-16 23:59:25,856 WARN org.apache.hadoop.hbase.master.HMaster: >>>>>>> info:regioninfo is empty for row: >>>>>>> UserData_0209,e834d76faddee14b,1266316478685; has keys: info:server, >>>>>>> info:serverstartcode >>>>>>> >>>>>>> Came across this in the regionserver log: >>>>>>> 2010-02-16 23:58:33,851 WARN >>>>>>> org.apache.hadoop.hbase.regionserver.Store: Skipping >>>>>>> hdfs://upp1.bmeu.com:50001/hbase/.META./1028785192/info/4080287239754005013 >>>>>>> because its empty. HBASE-646 DATA LOSS? >>>>>>> >>>>>>> Any ideas if the tables are recoverable? Its not a big deal for me to >>>>>>> re-insert from scratch as this is still in testing phase, >>>>>>> but would be curious to find out what has led to these issues in order >>>>>>> to possibly fix or at least not repeat. >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> On Tue, Feb 16, 2010 at 2:43 PM, Bluemetrix Development >>>>>>> <bmdevelopm...@gmail.com> wrote: >>>>>>> > Hi, Thanks for the explanation. >>>>>>> > >>>>>>> > Yes, I was able to cat the file from all three of my region servers: >>>>>>> > hadoop fs -cat /hbase/.META./1028785192/info/8254845156484129698 > >>>>>>> > tmp.out >>>>>>> > >>>>>>> > I have never came across this before, but this is the first time I've >>>>>>> > had 7M rows in the db. >>>>>>> > Is there anything going on that would bog down the network and cause >>>>>>> > this file to be unreachable? >>>>>>> > >>>>>>> > I have 3 servers. The master is running the jobtracker, namenode and >>>>>>> > hmaster. >>>>>>> > And all 3 are running datanodes, regionservers and zookeeper. >>>>>>> > >>>>>>> > Appreciate the help. >>>>>>> > >>>>>>> > On Tue, Feb 16, 2010 at 2:11 PM, Jean-Daniel Cryans >>>>>>> > <jdcry...@apache.org> wrote: >>>>>>> >> This line >>>>>>> >> java.io.IOException: java.io.IOException: Could not obtain block: >>>>>>> >> blk_-6288142015045035704_88516 >>>>>>> >> file=/hbase/.META./1028785192/info/8254845156484129698 >>>>>>> >> >>>>>>> >> Means that the region server wasn't able to fetch a block for the >>>>>>> >> .META. >>>>>>> >> table (the table where all region addresses are). Are you able to >>>>>>> >> open that >>>>>>> >> file using the bin/hadoop command line utility? >>>>>>> >> >>>>>>> >> J-D >>>>>>> >> >>>>>>> >> On Tue, Feb 16, 2010 at 11:08 AM, Bluemetrix Development < >>>>>>> >> bmdevelopm...@gmail.com> wrote: >>>>>>> >> >>>>>>> >>> Hi, >>>>>>> >>> I'm currently trying to run a count in hbase shell and it crashes >>>>>>> >>> right towards the end. >>>>>>> >>> This is turn seems to crash hbase or at least causes the >>>>>>> >>> regionservers >>>>>>> >>> to become unavailable. >>>>>>> >>> >>>>>>> >>> Here's the tail end of the count output: >>>>>>> >>> http://pastebin.com/m465346d0 >>>>>>> >>> >>>>>>> >>> I'm on version 0.20.2 and running this command: >>>>>>> >>> > count 'table', 1000000 >>>>>>> >>> >>>>>>> >>> Anyone with similar issues or ideas on this? >>>>>>> >>> Please let me know if you need further info. >>>>>>> >>> Thanks >>>>>>> >>> >>>>>>> >> >>>>>>> > >>>>>> >>>>>> _________________________________________________________________ >>>>>> Hotmail: Trusted email with powerful SPAM protection. >>>>>> http://clk.atdmt.com/GBL/go/201469227/direct/01/ >>>>> >>>> >>> >> >