On Mon, Dec 12, 2016 at 12:16 PM, Simon Riggs <si...@2ndquadrant.com> wrote: > On 12 December 2016 at 16:52, Robert Haas <robertmh...@gmail.com> wrote: >> On Mon, Dec 12, 2016 at 11:33 AM, Simon Riggs <si...@2ndquadrant.com> wrote: >>> Last week I noticed that the Wait Event/Locks system doesn't correctly >>> describe waits for tuple locks because in some cases that happens in >>> two stages. >> >> Well, I replied to that email to say that I didn't agree with your >> analysis. I think if something happens in two stages, those wait >> events should be distinguished. The whole point here is to get >> clarity on what the system is waiting for, and we lose that if we >> start trying to merge together things which are at the code level >> separate. > > Clarity is what we are both looking for then.
Granted. > I know I am waiting for a tuple lock. You want information about all > the lower levels. I'm good with that as long as the lower information > is somehow recorded against the higher level task, which it wouldn't > be in either of the cases I mention, hence why I bring it up again. So, I think that this may be a case where I built an apple and you are complaining that it's not an orange. I had very clearly in mind from the beginning of the wait event work that we were trying to expose low-level information about what the system was doing, and I advocated for this design as a way of doing that, I think, reasonably well. The statement that you want information about what is going on at a higher level is fair, but IMHO it's NOT fair to present that as a defect in what's been committed. It was never intended to do that, at least not by me, and I committed all of the relevant patches and had a fair amount of involvement with the design. You may think I should have been trying to solve a different problem and you may even be right, but that is a separate issue from how well I did at solving the problem I was attempting to solve. There was quite a lot of discussion 9-12 months ago (IIRC) about wanting additional detail to be associated with wait events. From what I understand, Oracle will not only report that it waited for a block to be read but also tells you for which block it was waiting, and some of the folks at Postgres Pro were advocating for the wait event facility to do something similar. I strongly resisted that kind of additional detail, because what makes the current system fast and low-impact, and therefore able to be on by default, is that all it does is one unsynchronized 4-byte write into shared memory. If we do anything more than that -- say 8 bytes, let alone the extra 20 bytes we'd need to store a relfilenode -- we're going to need to insert memory barriers in the path that updates the data in order to make sure that it can be read without tearing, and I'm afraid that's going to have a noticeable performance impact. Certainly, we'd need to check into that very carefully before doing it. Operations like reading a block or blocking on an LWLock are heavier than a couple of memory barriers, but they're not necessarily so much heavier that we can afford to throw extra memory barriers in those paths without any impact. Now, some of what you want to do here may be able to be done without making wait_event_info any wider than uint32, and to the extent that's possible without too much contortion I am fine with it. If you want to know that a tuple lock was being sought for an update rather than a delete, that could probably be exposed. But if you want to know WHICH tuple or even WHICH relation was affected, this mechanism isn't well-suited to that task. I think we may well want to add some new mechanism that reports those sorts of things, but THIS mechanism doesn't have the bit-space for it and isn't designed to do it. It's designed to give basic information and be so cheap that we can use it practically everywhere. For more detailed reporting, we should probably have facilities that are not turned on by default, or else facilities that are limited to cases where the volume can never be very high. You don't have to add a lot of overhead to cause a problem in a code path that executes tens of thousands of times per second per backend. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers