On Wed, Jun 8, 2011 at 7:30 PM, John Arbash Meinel <j...@arbash-meinel.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > > ... >> I would like to fix the postgresql one too; at the moment the way we >> work with it - due to its design around clustering and schema changes >> - is to change things once a month, which drives latency for feature >> work and performance work - we're *just now* landing a change we could >> have had out there for 3 weeks, if we didn't have a 4 week cycle. >> >> Postgresql having defects in this area isn't a reason to bring in >> other like defects in new components :) >> >> -Rob > > Absolutely. But my point is that postgres fails on both accounts. If the > master dies you're screwed, but you also can't stop one machine to > upgrade while the other keeps churning.
Right. And if we were evaluating DB's today, we would have having a discussion around precisely this point. Theres no guarantee that we wouldn't instead go for e.g. drizzle with NDB. > It sounds like Rabbit suffers from the same problem. Though it also > sounds like a 3s downtime wouldn't be nearly the problem a 5-min > downtime would be. (and much less than a 90min downtime window.) 3s is 3 times the target window for 99% of requests, its over half the total time new pages will be alloted. I would be less concerned with a 500ms failover (entire end->end event), but would prefer 100ms or so. Thats still 10% of our target request time. > I didn't know Rabbit particularly well. And I agree you don't want to > add more bad. > > However, if you have 2 Rabbits in active-passive. You stop the second > one to upgrade it, then you do a 5s downtime to switch, and upgrade the > first. (The old passive has become the new active). Is the issue that > you have a complete gap? Is it possible to haproxy this (some sort of > proxy that would queue up requests for the 5s necessary to switch over, > without killing them). That sounds like a great deal of complexity vs just accepting that rabbit can fail and lose its current queue. As for queuing requests up, yes, I think we could do that, but hell - we have HA http services trivially, and if we backend a queue onto e.g. cassandra we'd have (modulo split brain concerns) a truely HA queue. Or telehash, or even onto a DHT directly. > Is it that you aren't able to ever create a clean break? (You'll always > have random new requests that you can't shunt over to the new system, > because you can't shut down the old system because it is still serving > the last requests.) Yes, you'll always have in-flight requests, and so you need to decide how to handle them. Of particular concern are dispatched work items which are not idempotent; the queue going away and coming back will interact badly with a worker needing to report that it handled something - particularly if the worker fails too after doing the work... The persistence side of the design has -long- tendrils. I'm advocating that we do what we can without persistence - which should be a great deal of very interesting things. -Rob _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : launchpad-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp