On Fri, Jul 14, 2017 at 5:06 PM, Peter Geoghegan <p...@bowt.ie> wrote: > I think that what this probably comes down to, more than anything > else, is that you have leftmost hot/bloated leaf pages like this: > > > idx | level | l_item | blkno | btpo_prev | > btpo_next | btpo_flags | type | live_items | dead_items | > avg_item_size | page_size | free_size | highkey > -----------------------+-------+--------+-------+-----------+-----------+------------+------+------------+------------+---------------+-----------+-----------+------------------------- > ... > pgbench_accounts_pkey | 0 | 1 | 1 | 0 | > 2751 | 65 | l | 100 | 41 | 16 | > 8192 | 5328 | 11 00 00 00 00 00 00 00 > pgbench_accounts_pkey | 0 | 2 | 2751 | 1 | > 2746 | 65 | l | 48 | 90 | 16 | > 8192 | 5388 | 32 00 00 00 00 00 00 00 > ... > > The high key for the leftmost shows that only values below 0x11 belong > on the first page. This is about 16 or 17 possible distinct values, > and yet the page has 100 live items, and 41 dead items; in total, > there is room for 367 items. It's only slightly better with other > nearby pages. This is especially bad because once the keyspace gets > split up this finely, it's *impossible* to reverse it -- it's more or > less a permanent problem, at least until a REINDEX.
I've been thinking about this a lot, because this really does look like a pathological case to me. I think that this workload is very sensitive to how effective kill_prior_tuples/LP_DEAD hinting is. Or at least, I can imagine that mechanism doing a lot better than it actually manages to do here. I wonder if it's possible that commit 2ed5b87f9, which let MVCC snapshots not hold on to a pin on leaf pages, should have considered workloads like this. The whole point of that commit was to reduce blocking by VACUUM on index scans, and I think that it was effective in doing that in many important cases. My concern is that the commit added this, which could make LP_DEAD hinting occur less frequently: + * If the pin was released after reading the page, then we re-read it. If it + * has been modified since we read it (as determined by the LSN), we dare not + * flag any entries because it is possible that the old entry was vacuumed + * away and the TID was re-used by a completely different heap tuple. */ void -_bt_killitems(IndexAnd, ScanDesc scan, bool haveLock) +_bt_killitems(IndexScanDesc scan) Even if I'm basically right about this making LP_DEAD hinting occur less frequently in cases like the one from Alik, it might have actually been worth it even from the point of view of index bloat, because VACUUM operations complete more frequently, and so there is an indirect boost. It's not as if we've been that great at assessing problems like this before now, and so it bears considering if we could do better here without introducing much additional complexity. I just don't know. I would like to hear opinions on: How much of a difference might this have actually made? If it's a significant factor for any important workload, what can be done about it? Does anyone have any ideas about a less restrictive strategy for avoiding spuriously killing an index tuple whose heap TID was just recycled? ISTM that by doing this LSN check, _bt_killitems() is least likely to work when and where it is most needed. The most obvious way to correctly proceed even when LSN has changed would be to revisit the heap page when the leaf page LSN check doesn't work out, and revalidate the safety of proceeding with killing tuples. At least with unique indexes. It's probably extremely unlikely that the problem that we must avoid here will actually be observed in the real world, so it seems likely that an approach like this will be worth it. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers