Re: Story of my HBase Bugs / Feature Suggestions

Jonathan Gray Tue, 25 Aug 2009 19:57:59 -0700

That is excellent news, Bradford!  Thanks for sticking with us :)

JG


On Tue, August 25, 2009 7:14 pm, Bradford Stephens wrote:
> As a side note, we've been beating on RC2 for a week solid, and it's very
>  stable. We're really only limited by our RAM and GC, now :)
>
> On Sat, Aug 22, 2009 at 6:59 AM, Andrew Purtell <[email protected]>
> wrote:
>
>
>> Jon,
>>
>>
>> Cool. I suspected as much. I'm really glad to see those bugs were found
>> and fixed...
>>
>> - Andy
>>
>>
>>
>>
>>
>> ________________________________
>> From: Jonathan Gray <[email protected]>
>> To: [email protected]
>> Sent: Saturday, August 22, 2009 12:24:51 AM
>> Subject: Re: Story of my HBase Bugs / Feature Suggestions
>>
>>
>> Andy,
>>
>>
>> Bradford ran his imports when there was both a Scanner bug related to
>> snapshotting that opened up a race condition, as well as the nasty bugs
>> in getClosestBefore used to look things up in META.
>>
>> It was most likely a combination of both of these things making for
>> some rather nasty behavior.
>>
>> JG
>>
>>
>> Andrew Purtell wrote:
>>
>>> There are plans to host live region assignments in ZK and keep only
>>> an
>> up-to-date copy of this state in META for use on cold boot. This is on
>> the roadmap for 0.21 but perhaps could be considered for 0.20.1 also.
>> This may
>> help here.
>>> A TM development group saw the same behavior on a 0.19 cluster. We
>>> postponed looking at this because 0.20 has a significant rewrite of
>>> region assignment. However, it is interesting to hear such a similar
>>> description. I worry the underlying cause may be scanners getting
>>> stale
>> data on the RS as opposed to some master problem which could be solved
>> by the above, a more pervasive problem. Bradford, any chance you kept
>> around logs or similar which may provide clues?
>>>
>>> - Andy
>>>
>>>
>>>
>>>
>>>
>>> ________________________________
>>> From: Bradford Stephens <[email protected]>
>>> To: [email protected]
>>> Sent: Friday, August 21, 2009 6:48:17 AM
>>> Subject: Story of my HBase Bugs / Feature Suggestions
>>>
>>>
>>> Hey there,
>>>
>>>
>>> I'm sending out this summary of how I diagnosed what was wrong with
>>> my cluster in hopes that you can glean some knowledge/suggestions from
>>> it :) Thanks for the diagnostic footwork.
>>>
>>>
>>> A few days ago,  I noticed that simple MR jobs I was running against
>>>
>> .20-RC2
>>
>>> were failing. Scanners were reaching the end of a region, and then
>>> simply freezing. The only indication I had of this was the Mapper
>>> timing out
>> after
>>> 1000 seconds -- there were no error messages in the logs for either
>>>
>> Hadoop
>>
>>> or HBase.
>>>
>>> It turns out that my table was corrupt:
>>>
>>>
>>> 1. Doing a 'GET' from the shell on a row near the end of a region
>>>
>> resulted
>>> in an error: "Row not in expected region", or something to that
>>> effect.
>> It
>>
>>> re-appeared several times, and I never got the row content. 2. What
>>> the Master UI indicated for the region distribution was totally
>>> different from what the RS reported. Row key ranges were on different
>>>  servers than the UI knew about, and the nodes reported different
>>> start
>> and
>>> end keys for a region than the UI.
>>>
>>> I'm not sure how this arose: I noticed after a heavy insert job that
>>> when
>> we
>>> tried to shut down our cluster, it took 30 dots and more -- so we
>> manually
>>> killed master. Would that lead to corruption?
>>>
>>> I finally resolved the problem by dropping the table and re-loading
>>> the
>> data
>>>
>>> A few suggestions going forward:
>>> 1. More useful scanner error messages: GET reported that there was a
>>>
>> problem
>>> finding a certain row, why couldn't Scanner? There wasn't even a
>>> timeout
>> or
>>> anything -- it just sat there. 2. A fsck / restore would be useful for
>>> HBase. I imagine you can recreate
>>> .META. using .regioninfo and scanning blocks out of HDFS. This would
>>> play nice with the HBase bulk loader story, I suppose.
>>>
>>> I'll be happy to work on these in my spare time, if I ever get any ;)
>>>
>>>
>>> Cheers,
>>> Bradford
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>
>
> --
> http://www.roadtofailure.com -- The Fringes of Scalability, Social Media,
> and Computer Science
>

Re: Story of my HBase Bugs / Feature Suggestions

Reply via email to