I am curious what you mean by standard metrics format? I am all for
standards-based but what are those in the case of DBs. For environments
where O11y matters a lot, I think the challenge lies in mapping specific
query executions back to system characteristics. I am just thinking aloud
as a newbie to this community.

On Mon, Feb 14, 2022, 12:16 PM Greg Stark <st...@mit.edu> wrote:

> So I've been dealing a lot with building and maintaining dashboards
> for (fleets of) Postgres servers. And it's a pain. I have a few
> strongly held ideas about where the pain points are and what the right
> ways to tackle them are. Some of which are going to be controversial I
> think...
>
> The state of the art is pg_exporter which is a separate client that
> connects to the database and runs SQL queries to extract database
> metrics. The pain points I've seen are:
>
> 1) The exporter isn't reliable when things are failing. If your
> clients can't connect the exporter also can't connect leading to data
> gaps in your metrics for precisely the time windows where you need
> data. This can happen to connection exhaustion, xid wraparound, or
> even something as simple as someone taking an exclusive lock on
> something used in the sql queries.
>
> 2) SQL connections are tied to specific databases within a cluster.
> Making it hard to get data for all your databases if you have more
> than one. The exporter needs to reconnect to each database.
>
> 3) The exporter needs to listen on a different port from the
> postmaster. Making it necessary to write software to manage the
> mapping from server port to exporter port and that's left to the
> end-user as it varies from site to site.
>
> 4) The queries are customizable (the built-in ones don't exhaustively
> exporter postgres's metrics). As a result there's no standard
> dashboard that will work on any site out of the box. Moreover issue
> (3) also makes it impossible to implement one that works properly.
>
> 5) data needs to be marshaled from shared memory into SQL and then
> read by the client and re-emitted in the metric format. The double
> processing requires writing SQL queries very carefully to avoid losing
> fidelity for things like LSN positions, xids, etc. Moreover the
> latency and gathering data from multiple SQL queries results in
> metrics that are often out of sync with each other making them hard to
> interpret.
>
> All this said, I think we should have a component in Postgres that
> reads from the stats data directly and outputs metrics in standard
> metrics format directly. This would probably take the form of a
> background worker with a few tricky bits.
>
> This would mean there would be a standard official set of metrics
> available that a standard dashboard could rely on to be present at any
> site and it would be reliable if the SQL layer isn't functioning due
> to lack of connections or xid wraparound or locking issues.
>
> The elephant in the room is that issue (3) requires a bit of sleight
> of hand. Ideally I would want it to be listening on the same ports as
> the database. That means having the postmaster recognize metric
> requests and hand them to the metrics background worker instead of a
> backend. I'm not sure people are going to be ok with that....
>
> For now my approach is to implement a background worker that listens
> on a new port and is basically its own small web server with shared
> memory access. This ignores issue (3) and my hope is that when we have
> some experience with this approach we'll see how reliable it is and
> how comfortable we are with the kind of hacking in postmaster it would
> take to fix it. Fwiw I do think this is an important issue and not one
> that we can ignore indefinitely.
>
> There is another elephant in the room (it's a big room) which is that
> this all makes sense for stats data. It doesn't make much sense for
> data that currently lives in pg_class, pg_index, etc. In other words
> I'm mostly solving (2) by ignoring it and concentrating on stats data.
>
> I haven't settled on a good solution for that data. I vaguely lean
> towards saying that the volatile metrics in those tables should really
> live in stats or at least be mirrored there. That makes a clean
> definition of what Postgres thinks a metric is and what it thinks
> catalog data is. But I'm not sure that will really work in practice.
> In particular I think it's likely we'll need to get catalog data from
> every database anyways, for example to label things like tables with
> better labels than oids.
>
> This work is being funded by Aiven which is really interested in
> improving observability and integration between Postgres and other
> open source cloud software.
>
> --
> greg
>
>
>

Reply via email to