OK. Lets sink the RC. Its gotten too many -1s. HBASE-1792/3 are bad too.
For the record, I'm +1 on RC2 becoming release. Its been running here at pset on 110 nodes for last week or so. I downloaded it and checked out its documentation and started it up locally. St.Ack On Wed, Aug 26, 2009 at 9:04 AM, Jonathan Gray <[email protected]> wrote: > I'm with Andrew. -1 on RC2. > > I don't see the value in putting 0.20.0 into the wild when there are known > defects. Just to release 0.20.1 shortly thereafter saying, this fixes > important issues so please upgrade immediately. It's completely acceptable > to say, lots of people are using RC2 in production and it's fine to move > forward with... and upgrade to the release when it is available. Following > release of 0.20.0, we should all be on a PR kick; blogging, tweeting, > emailing, reaching out, and talking to everyone we can about the awesome new > release. So the initial release itself should be solid. > > > The balancing issue is a serious issue as it means if you lose a node and > it comes back online, or if you add a new node, your cluster will suffer > some serious reliability and performance issues. I don't think we should > consider this rare or fringe, in fact it means you can't do rolling restarts > properly. > > I experienced this in our running production system and eventually I had to > keep running the cluster w/o two of my nodes. If you have a node with a far > fewer regions than the others, then all new regions go to that > regionserver... load becomes horribly unbalanced if you have a > recent-data-bias, with a majority of reads and writes going to a single > node. This led to that RS being swamped w/ long GC pauses and generally bad > cluster stability. It's a release blocker alone, IMO. JSharp ran into this > yesterday which is how we realized it had been uncommitted. > > I *might* be okay with a release of 0.20.0 w/o a fix for HBASE-1784 because > it is very rare... however failed compactions leading to data loss is pretty > nasty and we should really try to fix it for release if we squash RC2 > anyways. This is at least worth putting some effort into over the next few > days to see if we can reproduce the issue and fix it (by rolling back failed > compactions properly). It's better that regions grow to huge sizes because > compactions fail, thus no splits, rather than complete data loss. > > HBASE-1780 should be fixed and should not be too difficult, but maybe not a > release blocker. > > HBASE-1794 we'll have to hear from Ryan what the status is of it. > > No one wants to delay release any longer, but the most important thing we > can do is make sure the release is solid... We can't say that we these open > issue. > > Also, HDFS-200 testing by Ryan is turning up some great stuff and he has > had success (creating a table, kill -9ing the RS and DN, and META recovers > fully and the table still exists... magic!). If we wait until Monday or so > to cut RC3 (hopefully with fixes for much of above), then perhaps by the > time we're ready for release we can also have "official" but experimental > support for HDFS-200. > > Ryan mentioned if it works sufficiently well he'd like to put it into > production at supr... and I feel the same here at streamy. If it generally > works, we'll want to put it into production as the current data loss story > is really the only frightening thing left :) > > JG > > > > Andrew Purtell wrote: > >> There is a lot riding on getting this release right. There have been some >> serious bugs unearthed since 0.20.0 RC1. This makes me nervous. I'm not sure >> I understand the rationale for releasing 0.20.0 now and then 0.20.1 in one >> week, as opposed to taking the same amount of time to run another RC cycle >> to produce a 0.20.0 without bad known defects. What is the benefit? >> HBASE-1794: Recovered data still seems missing until compaction, which >> might not happen for 24 hours. Seems like a fix is already known? >> HBASE-1780: Data loss, known fix. >> HBASE-1784: Data loss. >> >> I'll try to put up a patch/band-aid against at least one of these tonight. >> >> HBASE-1784 is really troubling. We should roll back a failed compaction, >> not vaporize data. -1 on those grounds alone. >> >> - Andy >> >> >> >> >> ________________________________ >> From: stack <[email protected]> >> To: [email protected] >> Sent: Wednesday, August 26, 2009 4:21:33 PM >> Subject: Re: ANN: hbase 0.20.0 Release Candidate 2 available for download >> >> It will take a week or so to roll a new RC and to test and vote on it. >> >> Why not let out RC2 as 0.20.0 and do 0.20.1 within the next week or so? >> >> The balancing issue happens when you new node online only. Usually >> balancing ain't bad. >> >> The Mathias issue is bad but still being investigated. >> >> Andrew? >> >> St.Ack >> >> >> On Wed, Aug 26, 2009 at 1:04 AM, Mathias Herberts < >> [email protected]> wrote: >> >> On Mon, Aug 24, 2009 at 16:51, Jean-Daniel Cryans<[email protected]> >>> wrote: >>> >>>> +1 I ran it without any problem for a while. I asked Mathias if 1784 >>>> should kill it and he thinks no since it is not deterministic. >>>> >>> Given the latest run I did and the associated logs/investigation which >>> clearly show that the missing rows is related to failed compactions I >>> change my mind and now think 1784 should kill this RC. >>> >>> so -1 for rc2. >>> >>> Mathias. >>> >>> >> >> >> >> >
