That is excellent news, Bradford! Thanks for sticking with us :) JG
On Tue, August 25, 2009 7:14 pm, Bradford Stephens wrote: > As a side note, we've been beating on RC2 for a week solid, and it's very > stable. We're really only limited by our RAM and GC, now :) > > On Sat, Aug 22, 2009 at 6:59 AM, Andrew Purtell <[email protected]> > wrote: > > >> Jon, >> >> >> Cool. I suspected as much. I'm really glad to see those bugs were found >> and fixed... >> >> - Andy >> >> >> >> >> >> ________________________________ >> From: Jonathan Gray <[email protected]> >> To: [email protected] >> Sent: Saturday, August 22, 2009 12:24:51 AM >> Subject: Re: Story of my HBase Bugs / Feature Suggestions >> >> >> Andy, >> >> >> Bradford ran his imports when there was both a Scanner bug related to >> snapshotting that opened up a race condition, as well as the nasty bugs >> in getClosestBefore used to look things up in META. >> >> It was most likely a combination of both of these things making for >> some rather nasty behavior. >> >> JG >> >> >> Andrew Purtell wrote: >> >>> There are plans to host live region assignments in ZK and keep only >>> an >> up-to-date copy of this state in META for use on cold boot. This is on >> the roadmap for 0.21 but perhaps could be considered for 0.20.1 also. >> This may >> help here. >>> A TM development group saw the same behavior on a 0.19 cluster. We >>> postponed looking at this because 0.20 has a significant rewrite of >>> region assignment. However, it is interesting to hear such a similar >>> description. I worry the underlying cause may be scanners getting >>> stale >> data on the RS as opposed to some master problem which could be solved >> by the above, a more pervasive problem. Bradford, any chance you kept >> around logs or similar which may provide clues? >>> >>> - Andy >>> >>> >>> >>> >>> >>> ________________________________ >>> From: Bradford Stephens <[email protected]> >>> To: [email protected] >>> Sent: Friday, August 21, 2009 6:48:17 AM >>> Subject: Story of my HBase Bugs / Feature Suggestions >>> >>> >>> Hey there, >>> >>> >>> I'm sending out this summary of how I diagnosed what was wrong with >>> my cluster in hopes that you can glean some knowledge/suggestions from >>> it :) Thanks for the diagnostic footwork. >>> >>> >>> A few days ago, I noticed that simple MR jobs I was running against >>> >> .20-RC2 >> >>> were failing. Scanners were reaching the end of a region, and then >>> simply freezing. The only indication I had of this was the Mapper >>> timing out >> after >>> 1000 seconds -- there were no error messages in the logs for either >>> >> Hadoop >> >>> or HBase. >>> >>> It turns out that my table was corrupt: >>> >>> >>> 1. Doing a 'GET' from the shell on a row near the end of a region >>> >> resulted >>> in an error: "Row not in expected region", or something to that >>> effect. >> It >> >>> re-appeared several times, and I never got the row content. 2. What >>> the Master UI indicated for the region distribution was totally >>> different from what the RS reported. Row key ranges were on different >>> servers than the UI knew about, and the nodes reported different >>> start >> and >>> end keys for a region than the UI. >>> >>> I'm not sure how this arose: I noticed after a heavy insert job that >>> when >> we >>> tried to shut down our cluster, it took 30 dots and more -- so we >> manually >>> killed master. Would that lead to corruption? >>> >>> I finally resolved the problem by dropping the table and re-loading >>> the >> data >>> >>> A few suggestions going forward: >>> 1. More useful scanner error messages: GET reported that there was a >>> >> problem >>> finding a certain row, why couldn't Scanner? There wasn't even a >>> timeout >> or >>> anything -- it just sat there. 2. A fsck / restore would be useful for >>> HBase. I imagine you can recreate >>> .META. using .regioninfo and scanning blocks out of HDFS. This would >>> play nice with the HBase bulk loader story, I suppose. >>> >>> I'll be happy to work on these in my spare time, if I ever get any ;) >>> >>> >>> Cheers, >>> Bradford >>> >>> >>> >> >> >> >> >> > > > > -- > http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, > and Computer Science >
