On Fri, 2010-10-08 at 17:06 +0200, Markus Wanner wrote: > Well, full cluster outages are infrequent, but sadly cannot be avoided > entirely. (Murphy's laughing). IMO we should be prepared to deal with > those.
I've described how I propose to deal with those. I'm not waving away these issues, just proposing that we consciously choose simplicity and therefore robustness. Let me say it again for clarity. (This is written for the general case, though my patch uses only k=1 i.e. one acknowledgement): If we want robustness, we have multiple standbys. So if you lose one, you continue as normal without interruption. That is the first and most important line of defence - not software. When we start to wait, if there aren't sufficient active standbys to acknowledge a commit, then the commit won't wait. This behaviour helps us avoid situations where we are hours or days away from having a working standby to acknowledge the commit. We've had a long debate about servers that "ought to be there" but aren't; I suggest we treat standbys that aren't there as having a strong possibility they won't come back, and hence not worth waiting for. Heikki disagrees; I have no problem with adding server registration so that we can add additional waits, but I doubt that the majority of users prefer waiting over availability. It can be an option Once we are waiting, if insufficient standbys acknowledge the commit we will wait until the timeout expires, after which we commit and continue working. If you don't like timeouts, set the timeout to 0 to wait forever. This behaviour is designed to emphasise availability. (I acknowledge that some people are so worried by data loss that they would choose to stop changes altogether, and accept unavailability; I regard that as a minority use case, but one which I would not argue against including as an options at some point in the future.) To cover Dimitri's observation that when a streaming standby first connects it might take some time before it can sensibly acknowledge, we don't activate the standby until it has caught up. Once caught up, it will advertise it's capability to offer a sync rep service. Standbys that don't wish to be failover targets can set synchronous_replication_service = off. The paths between servers aren't defined explicitly, so the parameters all still work even after failover. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers