On 17 October 2017 at 01:02, Joshua D. Drake <j...@commandprompt.com> wrote: > On 10/15/2017 07:39 PM, Craig Ringer wrote: >> >> On 13 October 2017 at 08:50, Joshua D. Drake <j...@commandprompt.com> wrote: >>> >>> -Hackers, >>> >>> I had a long call with a firm developing front end proxy/cache/HA for >>> Postgres today. Essentially the software is a replacement for PGPool in >>> entirety but also supports analytics etc... When I was asking them about >>> pain points they talked about the below and I was wondering if this is a >>> problem we would like to solve. >> >> >> IMO: no one node knows the full state of the system, or can know it. > > > That isn't exactly true. We do know if our replication state is current but > only from the master which is part of the problem.
Sure. But unless you have a perfectly-reliable, latency-ignoring wormhole link between master and standby, the standby always has imperfect knowledge of the master. More importantly, it can never know for sure how old its knowledge is. https://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-WAL-RECEIVER-VIEW already does about the best we can probably do. In particular last_msg_send_time and last_msg_receipt_time, used in combination with latest_end_lsn and latest_end_time. >> That said, I do think it'd be very desirable for us to introduce a >> greater link from a standby to master: >> >> - Get info about master. We should finish merging recovery.conf into >> postgresql.conf. > > > Definitely. There's a patch from Abhijit Menon-Sen you could adopt for PostgreSQL 11 for that. >>> 1. The dblink call doesn't have a way to specify a timeout, so we have >>> to >>> use Java futures to control how long this may take to a reasonable amount >>> of >>> time; >> >> >> statement_timeout doesn't work? > > > That would be a work around definitely but I think it would be better to > say: ALTER SYSTEM SET PROMOTE TIMEOUT '120' (Example, let's not get off into > the weeds :P) and if the standby can't receive a ping/ack within 120 it will > promote itself. I'm confused by this. I thought you were talking about timeouts querying status of an upstream over dblink. Not automatic self-promotion. I'm really not a fan of Pg standbys self-promoting without working with an external co-ordinator that handles STONITH/fencing. It's a recipe for disaster. That's what I was saying upthread, that implementing bits and pieces here can be quite dangerous. This also takes it well outside what you were talking about, improving the ability to detect Pg's state, and into having it become its own co-ordinator for HA actions. So lets go back to the original question. What's missing that statement_timeout doesn't provide for querying remote servers for their status over dblink? If you want a nicer way to say "look up whatever your conninfo in recovery.conf is, connect to it, get me some info on it and return it, possibly daisy-chaining up a chain of replicas if you reach the master" ... that's fine. But it's a different thing. >> Er, yes? I don't understand what you are getting at here. > > > Yes, I will need to go back to them on this one. I think what they mean is > that if we have a connection that is getting closed it doesn't return why it > is closing. It just throws an error. Yes, we do. From https://www.postgresql.org/docs/current/static/errcodes-appendix.html: Class 57 — Operator Intervention 57000 operator_intervention 57014 query_canceled 57P01 admin_shutdown 57P02 crash_shutdown 57P03 cannot_connect_now 57P04 database_dropped Maybe they want more granularity in terms of what reasons are given and what errors are reported. That's fine. But please provide specifics. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers