Yes, upgrading to 0.20.3 should be added to my list above. I have
since done this.
Thanks very much for that.

On Wed, Mar 3, 2010 at 4:44 PM, Jean-Daniel Cryans <jdcry...@apache.org> wrote:
> There were a lot of problems with Hadoop pre 0.20.2 for clusters
> smaller than 10, especially 3 when having node failure. If you are
> talking about just region servers, you are using 0.20.2 and 0.20.3 has
> stability fixes.
>
> J-D
>
> On Wed, Mar 3, 2010 at 12:41 PM, Bluemetrix Development
> <bmdevelopm...@gmail.com> wrote:
>> For completeness sake, I'll update here.
>> The issue with shell counts and rowcounter crashing were fixed by upping
>> - open files to 32K (ulimit -n)
>> - dfs.datanode.max.xcievers to 2048
>> (I had overlooked this when moving to a larger cluster)
>>
>> As for recovering from crashes, I haven't had much luck.
>> I'm only running a 3 server cluster so that may be an issue,
>> but when one server goes down, it doesn't seem to be too easy
>> to recover the Hbase table data after getting everything restarted again.
>> I've usually had to wipe hdfs and start from scratch.
>>
>> On Wed, Feb 17, 2010 at 12:59 PM, Bluemetrix Development
>> <bmdevelopm...@gmail.com> wrote:
>>> Hi, Thanks for the suggestions. I'll make note of this.
>>> (I've decided to reinsert, as with time constraints it is probably
>>> quicker than trying to debug and recover.)
>>> So, I guess I am more concerned about trying to prevent this from
>>> happening again.
>>> Is it possible that a shell count caused enough load to crash hbase?
>>> Or that nodes becoming unavailable due to heavy network load could
>>> cause data corruption?
>>>
>>> On Wed, Feb 17, 2010 at 12:42 PM, Michael Segel
>>> <michael_se...@hotmail.com> wrote:
>>>>
>>>> Try this...
>>>>
>>>> 1 run hadoop fsck /
>>>> 2 shut down hbase
>>>> 3 mv /hbase to /hbase.old
>>>> 4 restart /hbase (optional just for a sanity check)
>>>> 5 copy /hbase.old back to /hbase
>>>> 6 restart
>>>>
>>>> This may not help, but it can't hurt.
>>>> Depending on the size of your hbase database, it could take a while. On 
>>>> our sandbox, we suffer from zookeeper and hbase failures when there's a 
>>>> heavy load on the network. (Don't ask, the sandbox was just a play area on 
>>>> whatever hardware we could find.) Doing this copy cleaned up a database 
>>>> that wouldn't fully come up. May do the same for you.
>>>>
>>>> HTH
>>>>
>>>> -Mike
>>>>
>>>>
>>>>> Date: Wed, 17 Feb 2010 10:50:59 -0500
>>>>> Subject: Re: hbase shell count crashes
>>>>> From: bmdevelopm...@gmail.com
>>>>> To: hbase-user@hadoop.apache.org
>>>>>
>>>>> Hi,
>>>>> So after a few more attempts and crashes from trying the shell count,
>>>>> I ran the MR rowcounter and noticed that the number of rows were less
>>>>> than they should have been - even on smaller test tables.
>>>>> This led me to start looking through the logs and perform a few
>>>>> compacts on META and restarts of hbase. Unfortunately, now two tables
>>>>> are entirely missing - no longer show up under the shell list command.
>>>>>
>>>>> I'm not entirely sure what to look for in the logs, but I've noticed a
>>>>> lot of this in the master log.
>>>>>
>>>>> 2010-02-16 23:59:25,856 WARN org.apache.hadoop.hbase.master.HMaster:
>>>>> info:regioninfo is empty for row:
>>>>> UserData_0209,e834d76faddee14b,1266316478685; has keys: info:server,
>>>>> info:serverstartcode
>>>>>
>>>>> Came across this in the regionserver log:
>>>>> 2010-02-16 23:58:33,851 WARN
>>>>> org.apache.hadoop.hbase.regionserver.Store: Skipping
>>>>> hdfs://upp1.bmeu.com:50001/hbase/.META./1028785192/info/4080287239754005013
>>>>> because its empty. HBASE-646 DATA LOSS?
>>>>>
>>>>> Any ideas if the tables are recoverable? Its not a big deal for me to
>>>>> re-insert from scratch as this is still in testing phase,
>>>>> but would be curious to find out what has led to these issues in order
>>>>> to possibly fix or at least not repeat.
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Tue, Feb 16, 2010 at 2:43 PM, Bluemetrix Development
>>>>> <bmdevelopm...@gmail.com> wrote:
>>>>> > Hi, Thanks for the explanation.
>>>>> >
>>>>> > Yes, I was able to cat the file from all three of my region servers:
>>>>> > hadoop fs -cat /hbase/.META./1028785192/info/8254845156484129698 > 
>>>>> > tmp.out
>>>>> >
>>>>> > I have never came across this before, but this is the first time I've
>>>>> > had 7M rows in the db.
>>>>> > Is there anything going on that would bog down the network and cause
>>>>> > this file to be unreachable?
>>>>> >
>>>>> > I have 3 servers. The master is running the jobtracker, namenode and 
>>>>> > hmaster.
>>>>> > And all 3 are running datanodes, regionservers and zookeeper.
>>>>> >
>>>>> > Appreciate the help.
>>>>> >
>>>>> > On Tue, Feb 16, 2010 at 2:11 PM, Jean-Daniel Cryans 
>>>>> > <jdcry...@apache.org> wrote:
>>>>> >> This line
>>>>> >> java.io.IOException: java.io.IOException: Could not obtain block:
>>>>> >> blk_-6288142015045035704_88516
>>>>> >> file=/hbase/.META./1028785192/info/8254845156484129698
>>>>> >>
>>>>> >> Means that the region server wasn't able to fetch a block for the 
>>>>> >> .META.
>>>>> >> table (the table where all region addresses are). Are you able to open 
>>>>> >> that
>>>>> >> file using the bin/hadoop command line utility?
>>>>> >>
>>>>> >> J-D
>>>>> >>
>>>>> >> On Tue, Feb 16, 2010 at 11:08 AM, Bluemetrix Development <
>>>>> >> bmdevelopm...@gmail.com> wrote:
>>>>> >>
>>>>> >>> Hi,
>>>>> >>> I'm currently trying to run a count in hbase shell and it crashes
>>>>> >>> right towards the end.
>>>>> >>> This is turn seems to crash hbase or at least causes the regionservers
>>>>> >>> to become unavailable.
>>>>> >>>
>>>>> >>> Here's the tail end of the count output:
>>>>> >>> http://pastebin.com/m465346d0
>>>>> >>>
>>>>> >>> I'm on version 0.20.2 and running this command:
>>>>> >>> > count 'table', 1000000
>>>>> >>>
>>>>> >>> Anyone with similar issues or ideas on this?
>>>>> >>> Please let me know if you need further info.
>>>>> >>> Thanks
>>>>> >>>
>>>>> >>
>>>>> >
>>>>
>>>> _________________________________________________________________
>>>> Hotmail: Trusted email with powerful SPAM protection.
>>>> http://clk.atdmt.com/GBL/go/201469227/direct/01/
>>>
>>
>

Reply via email to