I am curious what you mean by standard metrics format? I am all for standards-based but what are those in the case of DBs. For environments where O11y matters a lot, I think the challenge lies in mapping specific query executions back to system characteristics. I am just thinking aloud as a newbie to this community.
On Mon, Feb 14, 2022, 12:16 PM Greg Stark <st...@mit.edu> wrote: > So I've been dealing a lot with building and maintaining dashboards > for (fleets of) Postgres servers. And it's a pain. I have a few > strongly held ideas about where the pain points are and what the right > ways to tackle them are. Some of which are going to be controversial I > think... > > The state of the art is pg_exporter which is a separate client that > connects to the database and runs SQL queries to extract database > metrics. The pain points I've seen are: > > 1) The exporter isn't reliable when things are failing. If your > clients can't connect the exporter also can't connect leading to data > gaps in your metrics for precisely the time windows where you need > data. This can happen to connection exhaustion, xid wraparound, or > even something as simple as someone taking an exclusive lock on > something used in the sql queries. > > 2) SQL connections are tied to specific databases within a cluster. > Making it hard to get data for all your databases if you have more > than one. The exporter needs to reconnect to each database. > > 3) The exporter needs to listen on a different port from the > postmaster. Making it necessary to write software to manage the > mapping from server port to exporter port and that's left to the > end-user as it varies from site to site. > > 4) The queries are customizable (the built-in ones don't exhaustively > exporter postgres's metrics). As a result there's no standard > dashboard that will work on any site out of the box. Moreover issue > (3) also makes it impossible to implement one that works properly. > > 5) data needs to be marshaled from shared memory into SQL and then > read by the client and re-emitted in the metric format. The double > processing requires writing SQL queries very carefully to avoid losing > fidelity for things like LSN positions, xids, etc. Moreover the > latency and gathering data from multiple SQL queries results in > metrics that are often out of sync with each other making them hard to > interpret. > > All this said, I think we should have a component in Postgres that > reads from the stats data directly and outputs metrics in standard > metrics format directly. This would probably take the form of a > background worker with a few tricky bits. > > This would mean there would be a standard official set of metrics > available that a standard dashboard could rely on to be present at any > site and it would be reliable if the SQL layer isn't functioning due > to lack of connections or xid wraparound or locking issues. > > The elephant in the room is that issue (3) requires a bit of sleight > of hand. Ideally I would want it to be listening on the same ports as > the database. That means having the postmaster recognize metric > requests and hand them to the metrics background worker instead of a > backend. I'm not sure people are going to be ok with that.... > > For now my approach is to implement a background worker that listens > on a new port and is basically its own small web server with shared > memory access. This ignores issue (3) and my hope is that when we have > some experience with this approach we'll see how reliable it is and > how comfortable we are with the kind of hacking in postmaster it would > take to fix it. Fwiw I do think this is an important issue and not one > that we can ignore indefinitely. > > There is another elephant in the room (it's a big room) which is that > this all makes sense for stats data. It doesn't make much sense for > data that currently lives in pg_class, pg_index, etc. In other words > I'm mostly solving (2) by ignoring it and concentrating on stats data. > > I haven't settled on a good solution for that data. I vaguely lean > towards saying that the volatile metrics in those tables should really > live in stats or at least be mirrored there. That makes a clean > definition of what Postgres thinks a metric is and what it thinks > catalog data is. But I'm not sure that will really work in practice. > In particular I think it's likely we'll need to get catalog data from > every database anyways, for example to label things like tables with > better labels than oids. > > This work is being funded by Aiven which is really interested in > improving observability and integration between Postgres and other > open source cloud software. > > -- > greg > > >