Sure, I've got a ton of logs. I'll try to grab what's most pertinent and put them on rapidshare, but there will be a ton of data to sift through :)
On Thu, Aug 20, 2009 at 8:57 PM, Andrew Purtell <[email protected]> wrote: > There are plans to host live region assignments in ZK and keep only an > up-to-date copy of this state in META for use on cold boot. This is on the > roadmap for 0.21 but perhaps could be considered for 0.20.1 also. This may > help here. > > A TM development group saw the same behavior on a 0.19 cluster. We > postponed looking at this because 0.20 has a significant rewrite of > region assignment. However, it is interesting to hear such a similar > description. I worry the underlying cause may be scanners getting stale > data on the RS as opposed to some master problem which could be solved by > the above, a more pervasive problem. Bradford, any chance you kept around > logs or similar which may provide clues? > > - Andy > > > > > ________________________________ > From: Bradford Stephens <[email protected]> > To: [email protected] > Sent: Friday, August 21, 2009 6:48:17 AM > Subject: Story of my HBase Bugs / Feature Suggestions > > Hey there, > > I'm sending out this summary of how I diagnosed what was wrong with my > cluster in hopes that you can glean some knowledge/suggestions from it :) > Thanks for the diagnostic footwork. > > A few days ago, I noticed that simple MR jobs I was running against > .20-RC2 > were failing. Scanners were reaching the end of a region, and then simply > freezing. The only indication I had of this was the Mapper timing out after > 1000 seconds -- there were no error messages in the logs for either Hadoop > or HBase. > > It turns out that my table was corrupt: > > 1. Doing a 'GET' from the shell on a row near the end of a region resulted > in an error: "Row not in expected region", or something to that effect. It > re-appeared several times, and I never got the row content. > 2. What the Master UI indicated for the region distribution was totally > different from what the RS reported. Row key ranges were on different > servers than the UI knew about, and the nodes reported different start and > end keys for a region than the UI. > > I'm not sure how this arose: I noticed after a heavy insert job that when > we > tried to shut down our cluster, it took 30 dots and more -- so we manually > killed master. Would that lead to corruption? > > I finally resolved the problem by dropping the table and re-loading the > data > > A few suggestions going forward: > 1. More useful scanner error messages: GET reported that there was a > problem > finding a certain row, why couldn't Scanner? There wasn't even a timeout or > anything -- it just sat there. > 2. A fsck / restore would be useful for HBase. I imagine you can recreate > .META. using .regioninfo and scanning blocks out of HDFS. This would play > nice with the HBase bulk loader story, I suppose. > > I'll be happy to work on these in my spare time, if I ever get any ;) > > Cheers, > Bradford > > > -- > http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, > and Computer Science > > > > > -- http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science
