On Thu, Dec 20, 2012 at 5:28 PM, Joshua Berkus <j...@agliodbs.com> wrote: >> > What would such a test look like? It's not obvious to me that >> > there's any rapid way for a user to detect this situation, without >> > checking each server individually. >> >> Change something on the master and observe that none of the supposed >> standbys notice? > > That doesn't sound like an infallible test, or a 60-second one. > > My point is that in a complex situation (imagine a shop with 9 replicated > servers in 3 different cascaded groups, immediately after a failover of the > original master), it would be easy for a sysadmin, responding to middle of > the night page, to accidentally fat-finger an IP address and create a cycle > instead of a new master. And once he's done that, a longish troubleshooting > process to figure out what's wrong and why writes aren't working, especially > if he goes to bed and some other sysadmin picks up the "Writes failing to > PostgreSQL" ticket. > > *if* it's relatively easy for us to detect cycles (that's a big if, I'm not > sure how we'd do it), then it would help a lot for us to at least emit a > WARNING. That would short-cut a lot of troubleshooting.
I'm sure it's possible; I don't *think* it's terribly easy. The usual algorithm for cycle detection is to have each node send to the next node the path that the data has taken. But, there's no unique identifier for each slave that I know of - you could use IP address, but that's not really unique. And, if the WAL passes through an archive, how do you deal with that? I'm sure somebody could figure all of this stuff out, but it seems fairly complicated for the benefit we'd get. I just don't think this is going to be a terribly common problem; if it turns out I'm wrong, I may revise my opinion. :-) To me, it seems that lag monitoring between master and standby is something that anyone running a complex replication configuration should be doing - and yeah, I think anything involving four standbys (or cascading) qualifies as complex. If you're doing that, you should notice pretty quickly that your replication lag is increasing steadily. You might also check pg_stat_replication the master and notice that there are no connections there any more. Could someone miss those tell-tale signs? Sure. But they could also set autovacuum_naptime to an hour and then file a support ticket complaining that about table bloat - and they do. Personally, as user screw-ups go, I'd consider that scenario (and its fourteen cousins, twenty-seven second cousins, and three hundred and ninety two other extended family members) as higher-priority and lower effort to fix than this particular thing. YMMV, of course. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers