Sure, I've got a ton of logs. I'll try to grab what's most pertinent and put
them on rapidshare, but there will be a ton of data to sift through :)

On Thu, Aug 20, 2009 at 8:57 PM, Andrew Purtell <[email protected]> wrote:

> There are plans to host live region assignments in ZK and keep only an
> up-to-date copy of this state in META for use on cold boot. This is on the
> roadmap for 0.21 but perhaps could be considered for 0.20.1 also. This may
> help here.
>
> A TM development group saw the same behavior on a 0.19 cluster. We
> postponed looking at this because 0.20 has a significant rewrite of
> region assignment. However, it is interesting to hear such a similar
> description. I worry the underlying cause may be scanners getting stale
> data on the RS as opposed to some master problem which could be solved by
> the above, a more pervasive problem. Bradford, any chance you kept around
> logs or similar which may provide clues?
>
>   - Andy
>
>
>
>
> ________________________________
> From: Bradford Stephens <[email protected]>
> To: [email protected]
> Sent: Friday, August 21, 2009 6:48:17 AM
> Subject: Story of my HBase Bugs / Feature Suggestions
>
> Hey there,
>
> I'm sending out this summary of how I diagnosed what was wrong with my
> cluster in hopes that you can glean some knowledge/suggestions from it :)
> Thanks for the diagnostic footwork.
>
> A few days ago,  I noticed that simple MR jobs I was running against
> .20-RC2
> were failing. Scanners were reaching the end of a region, and then simply
> freezing. The only indication I had of this was the Mapper timing out after
> 1000 seconds -- there were no error messages in the logs for either Hadoop
> or HBase.
>
> It turns out that my table was corrupt:
>
> 1. Doing a 'GET' from the shell on a row near the end of a region resulted
> in an error: "Row not in expected region", or something to that effect. It
> re-appeared several times, and I never got the row content.
> 2. What the Master UI indicated for the region distribution was totally
> different from what the RS reported. Row key ranges were on different
> servers than the UI knew about, and the nodes reported different start and
> end keys for a region than the UI.
>
> I'm not sure how this arose: I noticed after a heavy insert job that when
> we
> tried to shut down our cluster, it took 30 dots and more -- so we manually
> killed master. Would that lead to corruption?
>
> I finally resolved the problem by dropping the table and re-loading the
> data
>
> A few suggestions going forward:
> 1. More useful scanner error messages: GET reported that there was a
> problem
> finding a certain row, why couldn't Scanner? There wasn't even a timeout or
> anything -- it just sat there.
> 2. A fsck / restore would be useful for HBase. I imagine you can recreate
> .META. using .regioninfo and scanning blocks out of HDFS. This would play
> nice with the HBase bulk loader story, I suppose.
>
> I'll be happy to work on these in my spare time, if I ever get any ;)
>
> Cheers,
> Bradford
>
>
> --
> http://www.roadtofailure.com -- The Fringes of Scalability, Social Media,
> and Computer Science
>
>
>
>
>



-- 
http://www.roadtofailure.com -- The Fringes of Scalability, Social Media,
and Computer Science

Reply via email to