On Mon, Sep 14, 2015 at 3:03 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Mon, Sep 14, 2015 at 5:32 AM, Alexander Korotkov
> <aekorot...@gmail.com> wrote:
> > In order to build the consensus we need the roadmap for waits monitoring.
> > Would single byte in PgBackendStatus be the only way for tracking wait
> > events? Could we have pluggable infrastructure in waits monitoring: for
> > instance, hooks for wait event begin and end?
> No, it's not the only way of doing it. I proposed doing that way
> because it's simple and cheap, but I'm not hell-bent on it. My basic
> concern here is about the cost of this. I think that the most data we
> can report without some kind of synchronization protocol is one 4-byte
> integer. If we want to report anything more than that, we're going to
> need something like the st_changecount protocol, or a lock, and that's
> going to add very significantly - and in my view unacceptably - to the
> cost. I care very much about having this facility be something that
> we can use in lots of places, even extremely frequent operations like
> buffer reads and contended lwlock acquisition.
Yes, the major question is cost. But I think we should validate our
thoughts by experiments assuming there are more possible synchronization
protocols. Ildus posted implemention of double buffering approach that
showed quite low cost.
I think that there may be some *kinds of waits* for which it's
> practical to report additional detail. For example, suppose that when
> a heavyweight lock wait first happens, we just report the lock type
> (relation, tuple, etc.) but then when the deadlock detector expires,
> if we're still waiting, we report the entire lock tag. Well, that's
> going to happen infrequently enough, and is expensive enough anyway,
> that the cost doesn't matter. But if, every time we read a disk
> block, we take a lock (or bump a changecount and do a write barrier),
> dump the whole block tag in there, release the lock (or do another
> write barrier and bump the changecount again) that sounds kind of
> expensive to me. Maybe we can prove that it doesn't matter on any
> workload, but I doubt it. We're fighting for every cycle in some of
> these code paths, and there's good evidence that we're burning too
> many of them compared to competing products already.
Yes, but some competing products also provides comprehensive waits
monitoring too. That makes me think it should be possible for us too.
I am not a big fan of hooks as a way of resolving disagreements about
> the design. We may find that there are places where it's useful to
> have hooks so that different extensions can do different things, and
> that is fine. But we shouldn't use that as a way of punting the
> difficult questions. There isn't enough common understanding here of
> what we're all trying to get done and why we're trying to do it in
> particular ways rather than in other ways to jump to the conclusion
> that a hook is the right answer. I'd prefer to have a nice, built-in
> system that everyone agrees represents a good set of trade-offs than
> an extensible system.
I think the reason for hooks could be not only disagreements about design,
but platform dependent issues too.
Next step after we have view with current wait events will be gathering
some statistics of them. We can oppose at least two approaches here:
1) Periodical sampling of current wait events.
2) Measure each wait event duration. We could collect statistics for short
period locally and update shared memory structure periodically (using some
In the previous attempt to gather lwlocks statistics, you predict that
sampling could have a significant overhead . In contrast, on many
systems time measurements are cheap. We have implemented both approaches
and it shows that sampling every 1 milliseconds produce higher overhead
than individual duration measurements for each wait event. We can share
another version of waits monitoring based on sampling to make these results
reproducible for everybody. However, cheap time measurements are available
not for each platform. For instance, ISTM that on Windows time measurements
are too expensive .
That makes me think that we need pluggable solution, at least for
statistics: direct measuring of events durations for majority of systems
and sampling for others as the least harm.
I think it's reasonable to consider reporting this data in the PGPROC
> using a 4-byte integer rather than reporting it through a singe byte
> in the backend status structure. I believe that addresses the
> concerns about reporting from auxiliary processes, and it also allows
> a little more data to be reported. For anything in excess of that, I
> think we should think rather harder. Most likely, such addition
> detail should be reported only for certain types of wait events, or on
> a delay, or something like that, so that the core mechanism remains
> really, really fast.
That sounds reasonable. There are many pending questions, but it seems like
step forward to me.
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company