On Feb 11, 2011 8:20 PM, "Robert Haas" <robertmh...@gmail.com> wrote: > > On Fri, Feb 11, 2011 at 4:38 PM, Robert Haas <robertmh...@gmail.com> wrote: > > On Fri, Feb 11, 2011 at 4:30 PM, Heikki Linnakangas > > <heikki.linnakan...@enterprisedb.com> wrote: > >> On 11.02.2011 22:11, Robert Haas wrote: > >>> > >>> On Fri, Feb 11, 2011 at 2:02 PM, Daniel Farina<drfar...@acm.org> wrote: > >>>> > >>>> I split this out of the synchronous replication patch for independent > >>>> review. I'm dashing out the door, so I haven't put it on the CF yet or > >>>> anything, but I just wanted to get it out there...I'll be around in > >>>> Not Too Long to finish any other details. > >>> > >>> This looks like a useful and separately committable change. > >> > >> Hmm, so this patch implements a watchdog, where the master disconnects the > >> standby if the heartbeat from the standby stops for more than > >> 'replication_[server]_timeout' seconds. The standby sends the heartbeat > >> every wal_receiver_status_interval seconds. > >> > >> It would be nice if the master and standby could negotiate those settings. > >> As the patch stands, it's easy to have a pathological configuration where > >> replication_server_timeout < wal_receiver_status_interval, so that the > >> master repeatedly disconnects the standby because it doesn't reply in time. > >> Maybe the standby should report how often it's going to send a heartbeat, > >> and master should wait for that long + some safety margin. Or maybe the > >> master should tell the standby how often it should send the heartbeat? > > > > I guess the biggest use case for that behavior would be in a case > > where you have two standbys, one of which doesn't send a heartbeat and > > the other of which does. Then you really can't rely on a single > > timeout. > > > > Maybe we could change the server parameter to indicate what multiple > > of wal_receiver_status_interval causes a hangup, and then change the > > client to notify the server what value it's using. But that gets > > complicated, because the value could be changed while the standby is > > running. > > On reflection I'm deeply uncertain this is a good idea. It's pretty > hopeless to suppose that we can keep the user from choosing parameter > settings which will cause them problems, and there are certainly far > stupider things they could do then set replication_timeout < > wal_receiver_status_interval. They could, for example, set fsync=off > or work_mem=4GB or checkpoint_segments=3 (never mind that we ship that > last one out of the box). Any of those settings have the potential to > thoroughly destroy their system in one way or another, and there's not > a darn thing we can do about it. Setting up some kind of handshake > system based on a multiple of the wal_receiver_status_interval is > going to be complex, and it's not necessarily going to deliver the > behavior someone wants anyway. If someone has > wal_receiver_status_interval=10 on one system and =30 on another > system, does it therefore follow that the timeouts should also be > different by 3X? Perhaps, but it's non-obvious. > > There are two things that I think are pretty clear. If the receiver > has wal_receiver_status_interval=0, then we should ignore > replication_timeout for that connection. And also we need to make > sure that the replication_timeout can't kill off a connection that is > in the middle of streaming a large base backup. Maybe we should try > to get those two cases right and not worry about the rest. Dan, can > you check whether the base backup thing is a problem with this as > implemented?
Yes, I will have something to say come Saturday. -- fdr