Jon, Cool. I suspected as much. I'm really glad to see those bugs were found and fixed...
- Andy ________________________________ From: Jonathan Gray <[email protected]> To: [email protected] Sent: Saturday, August 22, 2009 12:24:51 AM Subject: Re: Story of my HBase Bugs / Feature Suggestions Andy, Bradford ran his imports when there was both a Scanner bug related to snapshotting that opened up a race condition, as well as the nasty bugs in getClosestBefore used to look things up in META. It was most likely a combination of both of these things making for some rather nasty behavior. JG Andrew Purtell wrote: > There are plans to host live region assignments in ZK and keep only an > up-to-date copy of this state in META for use on cold boot. This is on the > roadmap for 0.21 but perhaps could be considered for 0.20.1 also. This may > help here. > A TM development group saw the same behavior on a 0.19 cluster. We > postponed looking at this because 0.20 has a significant rewrite of > region assignment. However, it is interesting to hear such a similar > description. I worry the underlying cause may be scanners getting stale data > on the RS as opposed to some master problem which could be solved by the > above, a more pervasive problem. Bradford, any chance you kept around logs or > similar which may provide clues? > > - Andy > > > > > ________________________________ > From: Bradford Stephens <[email protected]> > To: [email protected] > Sent: Friday, August 21, 2009 6:48:17 AM > Subject: Story of my HBase Bugs / Feature Suggestions > > Hey there, > > I'm sending out this summary of how I diagnosed what was wrong with my > cluster in hopes that you can glean some knowledge/suggestions from it :) > Thanks for the diagnostic footwork. > > A few days ago, I noticed that simple MR jobs I was running against .20-RC2 > were failing. Scanners were reaching the end of a region, and then simply > freezing. The only indication I had of this was the Mapper timing out after > 1000 seconds -- there were no error messages in the logs for either Hadoop > or HBase. > > It turns out that my table was corrupt: > > 1. Doing a 'GET' from the shell on a row near the end of a region resulted > in an error: "Row not in expected region", or something to that effect. It > re-appeared several times, and I never got the row content. > 2. What the Master UI indicated for the region distribution was totally > different from what the RS reported. Row key ranges were on different > servers than the UI knew about, and the nodes reported different start and > end keys for a region than the UI. > > I'm not sure how this arose: I noticed after a heavy insert job that when we > tried to shut down our cluster, it took 30 dots and more -- so we manually > killed master. Would that lead to corruption? > > I finally resolved the problem by dropping the table and re-loading the data > > A few suggestions going forward: > 1. More useful scanner error messages: GET reported that there was a problem > finding a certain row, why couldn't Scanner? There wasn't even a timeout or > anything -- it just sat there. > 2. A fsck / restore would be useful for HBase. I imagine you can recreate > .META. using .regioninfo and scanning blocks out of HDFS. This would play > nice with the HBase bulk loader story, I suppose. > > I'll be happy to work on these in my spare time, if I ever get any ;) > > Cheers, > Bradford > >
