Hi, Thanks for the suggestions. I'll make note of this.
(I've decided to reinsert, as with time constraints it is probably
quicker than trying to debug and recover.)
So, I guess I am more concerned about trying to prevent this from
happening again.
Is it possible that a shell count caused enough load to crash hbase?
Or that nodes becoming unavailable due to heavy network load could
cause data corruption?

On Wed, Feb 17, 2010 at 12:42 PM, Michael Segel
<michael_se...@hotmail.com> wrote:
>
> Try this...
>
> 1 run hadoop fsck /
> 2 shut down hbase
> 3 mv /hbase to /hbase.old
> 4 restart /hbase (optional just for a sanity check)
> 5 copy /hbase.old back to /hbase
> 6 restart
>
> This may not help, but it can't hurt.
> Depending on the size of your hbase database, it could take a while. On our 
> sandbox, we suffer from zookeeper and hbase failures when there's a heavy 
> load on the network. (Don't ask, the sandbox was just a play area on whatever 
> hardware we could find.) Doing this copy cleaned up a database that wouldn't 
> fully come up. May do the same for you.
>
> HTH
>
> -Mike
>
>
>> Date: Wed, 17 Feb 2010 10:50:59 -0500
>> Subject: Re: hbase shell count crashes
>> From: bmdevelopm...@gmail.com
>> To: hbase-user@hadoop.apache.org
>>
>> Hi,
>> So after a few more attempts and crashes from trying the shell count,
>> I ran the MR rowcounter and noticed that the number of rows were less
>> than they should have been - even on smaller test tables.
>> This led me to start looking through the logs and perform a few
>> compacts on META and restarts of hbase. Unfortunately, now two tables
>> are entirely missing - no longer show up under the shell list command.
>>
>> I'm not entirely sure what to look for in the logs, but I've noticed a
>> lot of this in the master log.
>>
>> 2010-02-16 23:59:25,856 WARN org.apache.hadoop.hbase.master.HMaster:
>> info:regioninfo is empty for row:
>> UserData_0209,e834d76faddee14b,1266316478685; has keys: info:server,
>> info:serverstartcode
>>
>> Came across this in the regionserver log:
>> 2010-02-16 23:58:33,851 WARN
>> org.apache.hadoop.hbase.regionserver.Store: Skipping
>> hdfs://upp1.bmeu.com:50001/hbase/.META./1028785192/info/4080287239754005013
>> because its empty. HBASE-646 DATA LOSS?
>>
>> Any ideas if the tables are recoverable? Its not a big deal for me to
>> re-insert from scratch as this is still in testing phase,
>> but would be curious to find out what has led to these issues in order
>> to possibly fix or at least not repeat.
>>
>> Thanks
>>
>> On Tue, Feb 16, 2010 at 2:43 PM, Bluemetrix Development
>> <bmdevelopm...@gmail.com> wrote:
>> > Hi, Thanks for the explanation.
>> >
>> > Yes, I was able to cat the file from all three of my region servers:
>> > hadoop fs -cat /hbase/.META./1028785192/info/8254845156484129698 > tmp.out
>> >
>> > I have never came across this before, but this is the first time I've
>> > had 7M rows in the db.
>> > Is there anything going on that would bog down the network and cause
>> > this file to be unreachable?
>> >
>> > I have 3 servers. The master is running the jobtracker, namenode and 
>> > hmaster.
>> > And all 3 are running datanodes, regionservers and zookeeper.
>> >
>> > Appreciate the help.
>> >
>> > On Tue, Feb 16, 2010 at 2:11 PM, Jean-Daniel Cryans <jdcry...@apache.org> 
>> > wrote:
>> >> This line
>> >> java.io.IOException: java.io.IOException: Could not obtain block:
>> >> blk_-6288142015045035704_88516
>> >> file=/hbase/.META./1028785192/info/8254845156484129698
>> >>
>> >> Means that the region server wasn't able to fetch a block for the .META.
>> >> table (the table where all region addresses are). Are you able to open 
>> >> that
>> >> file using the bin/hadoop command line utility?
>> >>
>> >> J-D
>> >>
>> >> On Tue, Feb 16, 2010 at 11:08 AM, Bluemetrix Development <
>> >> bmdevelopm...@gmail.com> wrote:
>> >>
>> >>> Hi,
>> >>> I'm currently trying to run a count in hbase shell and it crashes
>> >>> right towards the end.
>> >>> This is turn seems to crash hbase or at least causes the regionservers
>> >>> to become unavailable.
>> >>>
>> >>> Here's the tail end of the count output:
>> >>> http://pastebin.com/m465346d0
>> >>>
>> >>> I'm on version 0.20.2 and running this command:
>> >>> > count 'table', 1000000
>> >>>
>> >>> Anyone with similar issues or ideas on this?
>> >>> Please let me know if you need further info.
>> >>> Thanks
>> >>>
>> >>
>> >
>
> _________________________________________________________________
> Hotmail: Trusted email with powerful SPAM protection.
> http://clk.atdmt.com/GBL/go/201469227/direct/01/

Reply via email to