As a side note, we've been beating on RC2 for a week solid, and it's very stable. We're really only limited by our RAM and GC, now :)
On Sat, Aug 22, 2009 at 6:59 AM, Andrew Purtell <[email protected]> wrote: > Jon, > > Cool. I suspected as much. I'm really glad to see those bugs were found and > fixed... > > - Andy > > > > > ________________________________ > From: Jonathan Gray <[email protected]> > To: [email protected] > Sent: Saturday, August 22, 2009 12:24:51 AM > Subject: Re: Story of my HBase Bugs / Feature Suggestions > > Andy, > > Bradford ran his imports when there was both a Scanner bug related to > snapshotting that opened up a race condition, as well as the nasty bugs in > getClosestBefore used to look things up in META. > > It was most likely a combination of both of these things making for some > rather nasty behavior. > > JG > > Andrew Purtell wrote: > > There are plans to host live region assignments in ZK and keep only an > up-to-date copy of this state in META for use on cold boot. This is on the > roadmap for 0.21 but perhaps could be considered for 0.20.1 also. This may > help here. > > A TM development group saw the same behavior on a 0.19 cluster. We > > postponed looking at this because 0.20 has a significant rewrite of > > region assignment. However, it is interesting to hear such a similar > > description. I worry the underlying cause may be scanners getting stale > data on the RS as opposed to some master problem which could be solved by > the above, a more pervasive problem. Bradford, any chance you kept around > logs or similar which may provide clues? > > > > - Andy > > > > > > > > > > ________________________________ > > From: Bradford Stephens <[email protected]> > > To: [email protected] > > Sent: Friday, August 21, 2009 6:48:17 AM > > Subject: Story of my HBase Bugs / Feature Suggestions > > > > Hey there, > > > > I'm sending out this summary of how I diagnosed what was wrong with my > > cluster in hopes that you can glean some knowledge/suggestions from it :) > > Thanks for the diagnostic footwork. > > > > A few days ago, I noticed that simple MR jobs I was running against > .20-RC2 > > were failing. Scanners were reaching the end of a region, and then simply > > freezing. The only indication I had of this was the Mapper timing out > after > > 1000 seconds -- there were no error messages in the logs for either > Hadoop > > or HBase. > > > > It turns out that my table was corrupt: > > > > 1. Doing a 'GET' from the shell on a row near the end of a region > resulted > > in an error: "Row not in expected region", or something to that effect. > It > > re-appeared several times, and I never got the row content. > > 2. What the Master UI indicated for the region distribution was totally > > different from what the RS reported. Row key ranges were on different > > servers than the UI knew about, and the nodes reported different start > and > > end keys for a region than the UI. > > > > I'm not sure how this arose: I noticed after a heavy insert job that when > we > > tried to shut down our cluster, it took 30 dots and more -- so we > manually > > killed master. Would that lead to corruption? > > > > I finally resolved the problem by dropping the table and re-loading the > data > > > > A few suggestions going forward: > > 1. More useful scanner error messages: GET reported that there was a > problem > > finding a certain row, why couldn't Scanner? There wasn't even a timeout > or > > anything -- it just sat there. > > 2. A fsck / restore would be useful for HBase. I imagine you can recreate > > .META. using .regioninfo and scanning blocks out of HDFS. This would play > > nice with the HBase bulk loader story, I suppose. > > > > I'll be happy to work on these in my spare time, if I ever get any ;) > > > > Cheers, > > Bradford > > > > > > > > > -- http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science
