Re: hbase shell count crashes

Bluemetrix Development Wed, 03 Mar 2010 15:29:59 -0800

Thanks. I'll take a look at that in depth as soon as I have a chance.

Seriously tho, brilliant work and thanks to all involved - its progressed
such a great deal even in the last 9 months I've been following /
using the product.
Really enjoying it.


On Wed, Mar 3, 2010 at 5:58 PM, Jean-Daniel Cryans <jdcry...@apache.org> wrote:
> Mmm then you might be hitting http://issues.apache.org/jira/browse/HBASE-2244
>
> As you can see we are working hard to stabilize HBase as much as possible ;)
>
> J-D
>
> On Wed, Mar 3, 2010 at 2:56 PM, Bluemetrix Development
> <bmdevelopm...@gmail.com> wrote:
>> Yes, upgrading to 0.20.3 should be added to my list above. I have
>> since done this.
>> Thanks very much for that.
>>
>> On Wed, Mar 3, 2010 at 4:44 PM, Jean-Daniel Cryans <jdcry...@apache.org> 
>> wrote:
>>> There were a lot of problems with Hadoop pre 0.20.2 for clusters
>>> smaller than 10, especially 3 when having node failure. If you are
>>> talking about just region servers, you are using 0.20.2 and 0.20.3 has
>>> stability fixes.
>>>
>>> J-D
>>>
>>> On Wed, Mar 3, 2010 at 12:41 PM, Bluemetrix Development
>>> <bmdevelopm...@gmail.com> wrote:
>>>> For completeness sake, I'll update here.
>>>> The issue with shell counts and rowcounter crashing were fixed by upping
>>>> - open files to 32K (ulimit -n)
>>>> - dfs.datanode.max.xcievers to 2048
>>>> (I had overlooked this when moving to a larger cluster)
>>>>
>>>> As for recovering from crashes, I haven't had much luck.
>>>> I'm only running a 3 server cluster so that may be an issue,
>>>> but when one server goes down, it doesn't seem to be too easy
>>>> to recover the Hbase table data after getting everything restarted again.
>>>> I've usually had to wipe hdfs and start from scratch.
>>>>
>>>> On Wed, Feb 17, 2010 at 12:59 PM, Bluemetrix Development
>>>> <bmdevelopm...@gmail.com> wrote:
>>>>> Hi, Thanks for the suggestions. I'll make note of this.
>>>>> (I've decided to reinsert, as with time constraints it is probably
>>>>> quicker than trying to debug and recover.)
>>>>> So, I guess I am more concerned about trying to prevent this from
>>>>> happening again.
>>>>> Is it possible that a shell count caused enough load to crash hbase?
>>>>> Or that nodes becoming unavailable due to heavy network load could
>>>>> cause data corruption?
>>>>>
>>>>> On Wed, Feb 17, 2010 at 12:42 PM, Michael Segel
>>>>> <michael_se...@hotmail.com> wrote:
>>>>>>
>>>>>> Try this...
>>>>>>
>>>>>> 1 run hadoop fsck /
>>>>>> 2 shut down hbase
>>>>>> 3 mv /hbase to /hbase.old
>>>>>> 4 restart /hbase (optional just for a sanity check)
>>>>>> 5 copy /hbase.old back to /hbase
>>>>>> 6 restart
>>>>>>
>>>>>> This may not help, but it can't hurt.
>>>>>> Depending on the size of your hbase database, it could take a while. On 
>>>>>> our sandbox, we suffer from zookeeper and hbase failures when there's a 
>>>>>> heavy load on the network. (Don't ask, the sandbox was just a play area 
>>>>>> on whatever hardware we could find.) Doing this copy cleaned up a 
>>>>>> database that wouldn't fully come up. May do the same for you.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> -Mike
>>>>>>
>>>>>>
>>>>>>> Date: Wed, 17 Feb 2010 10:50:59 -0500
>>>>>>> Subject: Re: hbase shell count crashes
>>>>>>> From: bmdevelopm...@gmail.com
>>>>>>> To: hbase-user@hadoop.apache.org
>>>>>>>
>>>>>>> Hi,
>>>>>>> So after a few more attempts and crashes from trying the shell count,
>>>>>>> I ran the MR rowcounter and noticed that the number of rows were less
>>>>>>> than they should have been - even on smaller test tables.
>>>>>>> This led me to start looking through the logs and perform a few
>>>>>>> compacts on META and restarts of hbase. Unfortunately, now two tables
>>>>>>> are entirely missing - no longer show up under the shell list command.
>>>>>>>
>>>>>>> I'm not entirely sure what to look for in the logs, but I've noticed a
>>>>>>> lot of this in the master log.
>>>>>>>
>>>>>>> 2010-02-16 23:59:25,856 WARN org.apache.hadoop.hbase.master.HMaster:
>>>>>>> info:regioninfo is empty for row:
>>>>>>> UserData_0209,e834d76faddee14b,1266316478685; has keys: info:server,
>>>>>>> info:serverstartcode
>>>>>>>
>>>>>>> Came across this in the regionserver log:
>>>>>>> 2010-02-16 23:58:33,851 WARN
>>>>>>> org.apache.hadoop.hbase.regionserver.Store: Skipping
>>>>>>> hdfs://upp1.bmeu.com:50001/hbase/.META./1028785192/info/4080287239754005013
>>>>>>> because its empty. HBASE-646 DATA LOSS?
>>>>>>>
>>>>>>> Any ideas if the tables are recoverable? Its not a big deal for me to
>>>>>>> re-insert from scratch as this is still in testing phase,
>>>>>>> but would be curious to find out what has led to these issues in order
>>>>>>> to possibly fix or at least not repeat.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> On Tue, Feb 16, 2010 at 2:43 PM, Bluemetrix Development
>>>>>>> <bmdevelopm...@gmail.com> wrote:
>>>>>>> > Hi, Thanks for the explanation.
>>>>>>> >
>>>>>>> > Yes, I was able to cat the file from all three of my region servers:
>>>>>>> > hadoop fs -cat /hbase/.META./1028785192/info/8254845156484129698 > 
>>>>>>> > tmp.out
>>>>>>> >
>>>>>>> > I have never came across this before, but this is the first time I've
>>>>>>> > had 7M rows in the db.
>>>>>>> > Is there anything going on that would bog down the network and cause
>>>>>>> > this file to be unreachable?
>>>>>>> >
>>>>>>> > I have 3 servers. The master is running the jobtracker, namenode and 
>>>>>>> > hmaster.
>>>>>>> > And all 3 are running datanodes, regionservers and zookeeper.
>>>>>>> >
>>>>>>> > Appreciate the help.
>>>>>>> >
>>>>>>> > On Tue, Feb 16, 2010 at 2:11 PM, Jean-Daniel Cryans 
>>>>>>> > <jdcry...@apache.org> wrote:
>>>>>>> >> This line
>>>>>>> >> java.io.IOException: java.io.IOException: Could not obtain block:
>>>>>>> >> blk_-6288142015045035704_88516
>>>>>>> >> file=/hbase/.META./1028785192/info/8254845156484129698
>>>>>>> >>
>>>>>>> >> Means that the region server wasn't able to fetch a block for the 
>>>>>>> >> .META.
>>>>>>> >> table (the table where all region addresses are). Are you able to 
>>>>>>> >> open that
>>>>>>> >> file using the bin/hadoop command line utility?
>>>>>>> >>
>>>>>>> >> J-D
>>>>>>> >>
>>>>>>> >> On Tue, Feb 16, 2010 at 11:08 AM, Bluemetrix Development <
>>>>>>> >> bmdevelopm...@gmail.com> wrote:
>>>>>>> >>
>>>>>>> >>> Hi,
>>>>>>> >>> I'm currently trying to run a count in hbase shell and it crashes
>>>>>>> >>> right towards the end.
>>>>>>> >>> This is turn seems to crash hbase or at least causes the 
>>>>>>> >>> regionservers
>>>>>>> >>> to become unavailable.
>>>>>>> >>>
>>>>>>> >>> Here's the tail end of the count output:
>>>>>>> >>> http://pastebin.com/m465346d0
>>>>>>> >>>
>>>>>>> >>> I'm on version 0.20.2 and running this command:
>>>>>>> >>> > count 'table', 1000000
>>>>>>> >>>
>>>>>>> >>> Anyone with similar issues or ideas on this?
>>>>>>> >>> Please let me know if you need further info.
>>>>>>> >>> Thanks
>>>>>>> >>>
>>>>>>> >>
>>>>>>> >
>>>>>>
>>>>>> _________________________________________________________________
>>>>>> Hotmail: Trusted email with powerful SPAM protection.
>>>>>> http://clk.atdmt.com/GBL/go/201469227/direct/01/
>>>>>
>>>>
>>>
>>
>

Re: hbase shell count crashes

Reply via email to