Robert, > I'm sure it's possible; I don't *think* it's terribly easy. The > usual > algorithm for cycle detection is to have each node send to the next > node the path that the data has taken. But, there's no unique > identifier for each slave that I know of - you could use IP address, > but that's not really unique. And, if the WAL passes through an > archive, how do you deal with that?
Not that I know how to do this, but it seems like a more direct approach is to check whether there's a master anywhere up the line. Hmmmm. Still sounds fairly difficult. > I'm sure somebody could figure > all of this stuff out, but it seems fairly complicated for the > benefit > we'd get. I just don't think this is going to be a terribly common > problem; if it turns out I'm wrong, I may revise my opinion. :-) I don't think it'll be that common either. The problem is that when it does happen, it'll be very hard for the hapless sysadmin involved to troubleshoot. > To me, it seems that lag monitoring between master and standby is > something that anyone running a complex replication configuration > should be doing - and yeah, I think anything involving four standbys > (or cascading) qualifies as complex. If you're doing that, you > should > notice pretty quickly that your replication lag is increasing > steadily. There are many reasons why replication lag would increase steadily. > You might also check pg_stat_replication the master and > notice that there are no connections there any more. Well, if you've created a true cycle, every server has one or more replicas. The original case I presented was the most probably cause of accidental cycles: the original master dies, and the on-call sysadmin accidentally connects the first replica to the last replica while trying to recover the cluster. AFAICT, the only way to troubleshoot a cycle is to test every server in the network to see if it's a master and has replicas, and if no server is a master with replicas, it's a cycle. Again, not fast or intuitive. Could someone > miss those tell-tale signs? Sure. But they could also set > autovacuum_naptime to an hour and then file a support ticket > complaining that about table bloat - and they do. Personally, as > user > screw-ups go, I'd consider that scenario (and its fourteen cousins, > twenty-seven second cousins, and three hundred and ninety two other > extended family members) as higher-priority and lower effort to fix > than this particular thing. I agree that this isn't a particularly high-priority issue. I do think it should go on the TODO list, though, just in case we get a GSOC student or other new contributor who wants to tackle it. --Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers