Re: [HACKERS] Immediate standby promotion

Simon Riggs Thu, 25 Sep 2014 10:18:57 -0700

On 25 September 2014 16:29, Andres Freund <and...@2ndquadrant.com> wrote:


>> > To me, being able to say "pg_ctl promote_right_now -m yes_i_mean_it"
>> > seems like a friendlier interface than making somebody shut down the
>> > server, run pg_resetxlog, and start it up again.
>>
>> It makes sense to go from paused --> promoted.
>>
>> It doesn't make sense to go from normal running --> promoted, since
>> that is just random data loss.
>
> Why? I don't see what's random in promoting a node in the current state
> *iff* it's currently consistent.
>
> Just imagine something like promoting a current standby to a full node
> because you want to run some tests on it that require writes. There's
> absolutely no need to investigate the current state for that.
>
>> I very much understand the case where
>> somebody is shouting "get the web site up, we are losing business".
>> Implementing a feature that allows people to do exactly what they
>> asked (go live now), but loses business transactions that we thought
>> had been safely recorded is not good. It implements only the exact
>> request, not its actual intention.
>
> That seems to be a problem of massively understanding on the part of the
> user. And I don't see how this is going to be safer by requiring the
> user to first issue a pause reuest.
>
> I think we should attempt to solve this by naming the command
> appropriately. Something like 'abort_replay_and_promote'. Long,
> nontrivial to type, and descriptive.
>
>> Any feature that lumps both cases together is wrongly designed and
>> will cause data loss.
>>
>> We go to a lot of trouble to ensure data is successfully on disk and
>> in WAL. I won't give that up, nor do I want to make it easier to lose
>> data than it already is.
>
> I think that's not really related. Such a promotion doesn't cause data
> loss in the sense of loosing data a *clueful* operator wanted to
> keep. Yes, it can be used wrongly, but it's far from alone in that.

Yes it does cause data loss. The clueful operator has no idea where
they are so there is no "used rightly" in that case.

If I were to give this feature a name it would be --discard or
--random-data-loss, or --reset-hard

The point of pausing is misunderstood. That is close but not quite relevant.

If you are at a known location and request promotion, we can presume
you know what you are doing, so it is simply Promote.

If you are at an unknown location and therefore have clearly not
verified any state before promotion, you are clearly making an
uninformed decision that will likely result in data loss, for which
there is no way of knowing the impact and no mechanism for recovering
from. Trying to promote something while it is still recovering proves
we don't know the state, we're just picking a random LSN.

So if you have a time delayed standby and the master breaks, or there
is a bad transaction then the correct action would be to
* pause the delayed standby
* discover where the master broke, or the xid of the bad transaction
* restart recovery to go up to the correct time/xid/lsn
* promote standby

That is already possible in 9.4

The original patch for pausing contained code to reset the PITR target
with functions, which would make the above even easier.

-- 
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Immediate standby promotion

Reply via email to