Re: [HACKERS] Replication server timeout patch

Daniel Farina Fri, 11 Feb 2011 20:51:37 -0800

On Feb 11, 2011 8:20 PM, "Robert Haas" <robertmh...@gmail.com> wrote:
>
> On Fri, Feb 11, 2011 at 4:38 PM, Robert Haas <robertmh...@gmail.com>
wrote:
> > On Fri, Feb 11, 2011 at 4:30 PM, Heikki Linnakangas
> > <heikki.linnakan...@enterprisedb.com> wrote:
> >> On 11.02.2011 22:11, Robert Haas wrote:
> >>>
> >>> On Fri, Feb 11, 2011 at 2:02 PM, Daniel Farina<drfar...@acm.org>
 wrote:
> >>>>
> >>>> I split this out of the synchronous replication patch for independent
> >>>> review. I'm dashing out the door, so I haven't put it on the CF yet
or
> >>>> anything, but I just wanted to get it out there...I'll be around in
> >>>> Not Too Long to finish any other details.
> >>>
> >>> This looks like a useful and separately committable change.
> >>
> >> Hmm, so this patch implements a watchdog, where the master disconnects
the
> >> standby if the heartbeat from the standby stops for more than
> >> 'replication_[server]_timeout' seconds. The standby sends the heartbeat
> >> every wal_receiver_status_interval seconds.
> >>
> >> It would be nice if the master and standby could negotiate those
settings.
> >> As the patch stands, it's easy to have a pathological configuration
where
> >> replication_server_timeout < wal_receiver_status_interval, so that the
> >> master repeatedly disconnects the standby because it doesn't reply in
time.
> >> Maybe the standby should report how often it's going to send a
heartbeat,
> >> and master should wait for that long + some safety margin. Or maybe the
> >> master should tell the standby how often it should send the heartbeat?
> >
> > I guess the biggest use case for that behavior would be in a case
> > where you have two standbys, one of which doesn't send a heartbeat and
> > the other of which does.  Then you really can't rely on a single
> > timeout.
> >
> > Maybe we could change the server parameter to indicate what multiple
> > of wal_receiver_status_interval causes a hangup, and then change the
> > client to notify the server what value it's using.  But that gets
> > complicated, because the value could be changed while the standby is
> > running.
>
> On reflection I'm deeply uncertain this is a good idea.  It's pretty
> hopeless to suppose that we can keep the user from choosing parameter
> settings which will cause them problems, and there are certainly far
> stupider things they could do then set replication_timeout <
> wal_receiver_status_interval.  They could, for example, set fsync=off
> or work_mem=4GB or checkpoint_segments=3 (never mind that we ship that
> last one out of the box).  Any of those settings have the potential to
> thoroughly destroy their system in one way or another, and there's not
> a darn thing we can do about it.  Setting up some kind of handshake
> system based on a multiple of the wal_receiver_status_interval is
> going to be complex, and it's not necessarily going to deliver the
> behavior someone wants anyway.  If someone has
> wal_receiver_status_interval=10 on one system and =30 on another
> system, does it therefore follow that the timeouts should also be
> different by 3X?  Perhaps, but it's non-obvious.
>
> There are two things that I think are pretty clear.  If the receiver
> has wal_receiver_status_interval=0, then we should ignore
> replication_timeout for that connection.  And also we need to make
> sure that the replication_timeout can't kill off a connection that is
> in the middle of streaming a large base backup.  Maybe we should try
> to get those two cases right and not worry about the rest.  Dan, can
> you check whether the base backup thing is a problem with this as
> implemented?


Yes, I will have something to say come Saturday.

--
fdr

Re: [HACKERS] Replication server timeout patch

Reply via email to