Jon,

Cool. I suspected as much. I'm really glad to see those bugs were found and 
fixed... 

   - Andy




________________________________
From: Jonathan Gray <[email protected]>
To: [email protected]
Sent: Saturday, August 22, 2009 12:24:51 AM
Subject: Re: Story of my HBase Bugs / Feature Suggestions

Andy,

Bradford ran his imports when there was both a Scanner bug related to 
snapshotting that opened up a race condition, as well as the nasty bugs in 
getClosestBefore used to look things up in META.

It was most likely a combination of both of these things making for some rather 
nasty behavior.

JG

Andrew Purtell wrote:
> There are plans to host live region assignments in ZK and keep only an 
> up-to-date copy of this state in META for use on cold boot. This is on the 
> roadmap for 0.21 but perhaps could be considered for 0.20.1 also. This may 
> help here. 
> A TM development group saw the same behavior on a 0.19 cluster. We
> postponed looking at this because 0.20 has a significant rewrite of
> region assignment. However, it is interesting to hear such a similar
> description. I worry the underlying cause may be scanners getting stale data 
> on the RS as opposed to some master problem which could be solved by the 
> above, a more pervasive problem. Bradford, any chance you kept around logs or 
> similar which may provide clues?
> 
>    - Andy
> 
> 
> 
> 
> ________________________________
> From: Bradford Stephens <[email protected]>
> To: [email protected]
> Sent: Friday, August 21, 2009 6:48:17 AM
> Subject: Story of my HBase Bugs / Feature Suggestions
> 
> Hey there,
> 
> I'm sending out this summary of how I diagnosed what was wrong with my
> cluster in hopes that you can glean some knowledge/suggestions from it :)
> Thanks for the diagnostic footwork.
> 
> A few days ago,  I noticed that simple MR jobs I was running against .20-RC2
> were failing. Scanners were reaching the end of a region, and then simply
> freezing. The only indication I had of this was the Mapper timing out after
> 1000 seconds -- there were no error messages in the logs for either Hadoop
> or HBase.
> 
> It turns out that my table was corrupt:
> 
> 1. Doing a 'GET' from the shell on a row near the end of a region resulted
> in an error: "Row not in expected region", or something to that effect. It
> re-appeared several times, and I never got the row content.
> 2. What the Master UI indicated for the region distribution was totally
> different from what the RS reported. Row key ranges were on different
> servers than the UI knew about, and the nodes reported different start and
> end keys for a region than the UI.
> 
> I'm not sure how this arose: I noticed after a heavy insert job that when we
> tried to shut down our cluster, it took 30 dots and more -- so we manually
> killed master. Would that lead to corruption?
> 
> I finally resolved the problem by dropping the table and re-loading the data
> 
> A few suggestions going forward:
> 1. More useful scanner error messages: GET reported that there was a problem
> finding a certain row, why couldn't Scanner? There wasn't even a timeout or
> anything -- it just sat there.
> 2. A fsck / restore would be useful for HBase. I imagine you can recreate
> .META. using .regioninfo and scanning blocks out of HDFS. This would play
> nice with the HBase bulk loader story, I suppose.
> 
> I'll be happy to work on these in my spare time, if I ever get any ;)
> 
> Cheers,
> Bradford
> 
> 



      

Reply via email to