* Josh Berkus (j...@agliodbs.com) wrote:
> On 01/11/2014 08:52 PM, Amit Kapila wrote:> It is better than async mode
> in a way such that in async mode it never
> > waits for commits to be written to standby, but in this new mode it will
> > do so unless it is not possible (all sync standby's goes down).
> > Can't we use existing wal_sender_timeout, or even if user expects a
> > different timeout because for this new mode, he expects master to wait
> > more before it start operating like standalone sync master, we can provide
> > a new parameter.
> One of the reasons that there's so much disagreement about this feature
> is that most of the folks strongly in favor of auto-degrade are thinking
> *only* of the case that the standby is completely down.  There are many
> other reasons for a sync transaction to hang, and the walsender has
> absolutely no way of knowing which is the case.  For example:

Uhh, yea, no, I'm pretty sure those in favor of auto-degrade are very
specifically thinking of cases like "Standby is restarting", which is
not a reason for the master to fall over.

> * Transient network issues
> * Standby can't keep up with master
> * Postgres bug
> * Storage/IO issues (think EBS)
> * Standby is restarting
> You don't want to handle all of those issues the same way as far as sync
> rep is concerned.  For example, if the standby is restaring, you
> probably want to wait instead of degrading.

*What*?!  Certainly not in any kind of OLTP-type system; a system
restart can easily take minutes.  Clearly, you want to resume once the
standby is back up, which I feel like the people against an auto-degrade
mode are missing, but holding up a commit until the standby finishes
rebooting isn't practical.

> There's also the issue that this patch, and necessarily any
> walsender-level auto-degrade, has IMHO no safe way to resume sync
> replication.  This means that any use who has a network or storage blip
> once a day (again, think AWS) would be constantly in degraded mode, even
> though both the master and the replica are up and running -- and it will
> come as a complete surprise to them when the lose the master and
> discover that they've lost data.

I don't follow this logic at all- why is there no safe way to resume?
You wait til the slave is caught up fully and then go back to sync mode.
If that turns out to be an extended problem then an alarm needs to be
raised, of course.



Attachment: signature.asc
Description: Digital signature

Reply via email to