There were a lot of problems with Hadoop pre 0.20.2 for clusters
smaller than 10, especially 3 when having node failure. If you are
talking about just region servers, you are using 0.20.2 and 0.20.3 has
stability fixes.

J-D

On Wed, Mar 3, 2010 at 12:41 PM, Bluemetrix Development
<bmdevelopm...@gmail.com> wrote:
> For completeness sake, I'll update here.
> The issue with shell counts and rowcounter crashing were fixed by upping
> - open files to 32K (ulimit -n)
> - dfs.datanode.max.xcievers to 2048
> (I had overlooked this when moving to a larger cluster)
>
> As for recovering from crashes, I haven't had much luck.
> I'm only running a 3 server cluster so that may be an issue,
> but when one server goes down, it doesn't seem to be too easy
> to recover the Hbase table data after getting everything restarted again.
> I've usually had to wipe hdfs and start from scratch.
>
> On Wed, Feb 17, 2010 at 12:59 PM, Bluemetrix Development
> <bmdevelopm...@gmail.com> wrote:
>> Hi, Thanks for the suggestions. I'll make note of this.
>> (I've decided to reinsert, as with time constraints it is probably
>> quicker than trying to debug and recover.)
>> So, I guess I am more concerned about trying to prevent this from
>> happening again.
>> Is it possible that a shell count caused enough load to crash hbase?
>> Or that nodes becoming unavailable due to heavy network load could
>> cause data corruption?
>>
>> On Wed, Feb 17, 2010 at 12:42 PM, Michael Segel
>> <michael_se...@hotmail.com> wrote:
>>>
>>> Try this...
>>>
>>> 1 run hadoop fsck /
>>> 2 shut down hbase
>>> 3 mv /hbase to /hbase.old
>>> 4 restart /hbase (optional just for a sanity check)
>>> 5 copy /hbase.old back to /hbase
>>> 6 restart
>>>
>>> This may not help, but it can't hurt.
>>> Depending on the size of your hbase database, it could take a while. On our 
>>> sandbox, we suffer from zookeeper and hbase failures when there's a heavy 
>>> load on the network. (Don't ask, the sandbox was just a play area on 
>>> whatever hardware we could find.) Doing this copy cleaned up a database 
>>> that wouldn't fully come up. May do the same for you.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>
>>>> Date: Wed, 17 Feb 2010 10:50:59 -0500
>>>> Subject: Re: hbase shell count crashes
>>>> From: bmdevelopm...@gmail.com
>>>> To: hbase-user@hadoop.apache.org
>>>>
>>>> Hi,
>>>> So after a few more attempts and crashes from trying the shell count,
>>>> I ran the MR rowcounter and noticed that the number of rows were less
>>>> than they should have been - even on smaller test tables.
>>>> This led me to start looking through the logs and perform a few
>>>> compacts on META and restarts of hbase. Unfortunately, now two tables
>>>> are entirely missing - no longer show up under the shell list command.
>>>>
>>>> I'm not entirely sure what to look for in the logs, but I've noticed a
>>>> lot of this in the master log.
>>>>
>>>> 2010-02-16 23:59:25,856 WARN org.apache.hadoop.hbase.master.HMaster:
>>>> info:regioninfo is empty for row:
>>>> UserData_0209,e834d76faddee14b,1266316478685; has keys: info:server,
>>>> info:serverstartcode
>>>>
>>>> Came across this in the regionserver log:
>>>> 2010-02-16 23:58:33,851 WARN
>>>> org.apache.hadoop.hbase.regionserver.Store: Skipping
>>>> hdfs://upp1.bmeu.com:50001/hbase/.META./1028785192/info/4080287239754005013
>>>> because its empty. HBASE-646 DATA LOSS?
>>>>
>>>> Any ideas if the tables are recoverable? Its not a big deal for me to
>>>> re-insert from scratch as this is still in testing phase,
>>>> but would be curious to find out what has led to these issues in order
>>>> to possibly fix or at least not repeat.
>>>>
>>>> Thanks
>>>>
>>>> On Tue, Feb 16, 2010 at 2:43 PM, Bluemetrix Development
>>>> <bmdevelopm...@gmail.com> wrote:
>>>> > Hi, Thanks for the explanation.
>>>> >
>>>> > Yes, I was able to cat the file from all three of my region servers:
>>>> > hadoop fs -cat /hbase/.META./1028785192/info/8254845156484129698 > 
>>>> > tmp.out
>>>> >
>>>> > I have never came across this before, but this is the first time I've
>>>> > had 7M rows in the db.
>>>> > Is there anything going on that would bog down the network and cause
>>>> > this file to be unreachable?
>>>> >
>>>> > I have 3 servers. The master is running the jobtracker, namenode and 
>>>> > hmaster.
>>>> > And all 3 are running datanodes, regionservers and zookeeper.
>>>> >
>>>> > Appreciate the help.
>>>> >
>>>> > On Tue, Feb 16, 2010 at 2:11 PM, Jean-Daniel Cryans 
>>>> > <jdcry...@apache.org> wrote:
>>>> >> This line
>>>> >> java.io.IOException: java.io.IOException: Could not obtain block:
>>>> >> blk_-6288142015045035704_88516
>>>> >> file=/hbase/.META./1028785192/info/8254845156484129698
>>>> >>
>>>> >> Means that the region server wasn't able to fetch a block for the .META.
>>>> >> table (the table where all region addresses are). Are you able to open 
>>>> >> that
>>>> >> file using the bin/hadoop command line utility?
>>>> >>
>>>> >> J-D
>>>> >>
>>>> >> On Tue, Feb 16, 2010 at 11:08 AM, Bluemetrix Development <
>>>> >> bmdevelopm...@gmail.com> wrote:
>>>> >>
>>>> >>> Hi,
>>>> >>> I'm currently trying to run a count in hbase shell and it crashes
>>>> >>> right towards the end.
>>>> >>> This is turn seems to crash hbase or at least causes the regionservers
>>>> >>> to become unavailable.
>>>> >>>
>>>> >>> Here's the tail end of the count output:
>>>> >>> http://pastebin.com/m465346d0
>>>> >>>
>>>> >>> I'm on version 0.20.2 and running this command:
>>>> >>> > count 'table', 1000000
>>>> >>>
>>>> >>> Anyone with similar issues or ideas on this?
>>>> >>> Please let me know if you need further info.
>>>> >>> Thanks
>>>> >>>
>>>> >>
>>>> >
>>>
>>> _________________________________________________________________
>>> Hotmail: Trusted email with powerful SPAM protection.
>>> http://clk.atdmt.com/GBL/go/201469227/direct/01/
>>
>

Reply via email to