Re: [HACKERS] After switching primary server while using replication slot.

Robert Haas Wed, 20 Aug 2014 10:22:15 -0700

On Tue, Aug 19, 2014 at 6:25 AM, Fujii Masao <masao.fu...@gmail.com> wrote:
> On Mon, Aug 18, 2014 at 11:16 PM, Sawada Masahiko <sawada.m...@gmail.com> 
> wrote:
>> Hi all,
>> After switching primary serer while using repliaction slot, the
>> standby server will not able to connect new primary server.
>> Imagine this situation, if primary server has two ASYNC standby
>> servers, also use each replication slots.
>> And the one standby(A) apply WAL without problems. But another one
>> standby(B) has stopped after connected to primary server.
>> (or sending WAL is too delayed)
>>
>> In this situation, the standby(B) has not received WAL segment file
>> while stopping itself.
>> And the primary server can not remove WAL segments which has not been
>> received to all standby.
>> Therefore the primary server have to keep the WAL segment file which
>> has not been received to all standby.
>> But standby(A) can do checkpoint itself, and then it's possible to
>> recycle WAL segments.
>> The number of WAL segment of each server are different.
>> ( The number of WAL files of standby(A) having smaller than primary server.)
>> After the primary server is crashed, the standby(A) promote to primary,
>> we can try to connect standby(B) to standby(A) as new standby server.
>> But it will be failed because the standby(A) server might not have WAL
>> segment files that standby(B) required.
>
> This sounds valid concern.
>
>> To resolve this situation, I think that we should make master server
>> to notify about removal of WAL segment to all standby servers.
>> And the standby servers recycle WAL segments files base on that information.
>>
>> Thought?
>
> How does the server recycle WAL files after it's promoted from the
> standby to master?
> It does that as it likes? If yes, your approach would not be enough.
>
> The approach prevents unexpected removal of WAL files while the standby
> is running. But after the standby is promoted to master, it might recycle
> needed WAL files immediately. So another standby may still fail to retrieve
> the required WAL file after the promotion.
>
> ISTM that, in order to address this, we might need to log all the replication
> slot activities and replicate them to the standby. I'm not sure if this
> breaks the design of replication slot at all, though.


Yuck.

I believe that the reason why replication slots are not currently
replicated is because we had the idea that the standby could have
slots that don't exist on the master, for cascading replication.  I'm
not sure that works yet, but I think Andres definitely had it in mind
in the original design.

It seems to me that if every machine needs to keep not only the WAL it
requires for itself, but also the WAL that any of other machine in the
replication hierarchy might need, that's pretty much sucks.  Suppose
you have a master with 10 standbys, and each standby has 10 cascaded
standbys.  If one of those standbys goes down, do we really want all
100 other machines to keep copies of all the WAL?  That seems rather
unfortunate, since it's likely that only a few of those many standbys
are machines to which we would consider failing over.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] After switching primary server while using replication slot.

Reply via email to