On Fri, Jul 14, 2017 at 5:06 PM, Peter Geoghegan <p...@bowt.ie> wrote:
> I think that what this probably comes down to, more than anything
> else, is that you have leftmost hot/bloated leaf pages like this:
>           idx          | level | l_item | blkno | btpo_prev |
> btpo_next | btpo_flags | type | live_items | dead_items |
> avg_item_size | page_size | free_size |         highkey
> -----------------------+-------+--------+-------+-----------+-----------+------------+------+------------+------------+---------------+-----------+-----------+-------------------------
>  ...
>  pgbench_accounts_pkey |     0 |      1 |     1 |         0 |
> 2751 |         65 | l    |        100 |         41 |            16 |
>    8192 |      5328 | 11 00 00 00 00 00 00 00
>  pgbench_accounts_pkey |     0 |      2 |  2751 |         1 |
> 2746 |         65 | l    |         48 |         90 |            16 |
>    8192 |      5388 | 32 00 00 00 00 00 00 00
> ...
> The high key for the leftmost shows that only values below 0x11 belong
> on the first page. This is about 16 or 17 possible distinct values,
> and yet the page has 100 live items, and 41 dead items; in total,
> there is room for 367 items. It's only slightly better with other
> nearby pages. This is especially bad because once the keyspace gets
> split up this finely, it's *impossible* to reverse it -- it's more or
> less a permanent problem, at least until a REINDEX.

I've been thinking about this a lot, because this really does look
like a pathological case to me. I think that this workload is very
sensitive to how effective kill_prior_tuples/LP_DEAD hinting is. Or at
least, I can imagine that mechanism doing a lot better than it
actually manages to do here. I wonder if it's possible that commit
2ed5b87f9, which let MVCC snapshots not hold on to a pin on leaf
pages, should have considered workloads like this.

The whole point of that commit was to reduce blocking by VACUUM on
index scans, and I think that it was effective in doing that in many
important cases. My concern is that the commit added this, which could
make LP_DEAD hinting occur less frequently:

+ * If the pin was released after reading the page, then we re-read it.  If it
+ * has been modified since we read it (as determined by the LSN), we dare not
+ * flag any entries because it is possible that the old entry was vacuumed
+ * away and the TID was re-used by a completely different heap tuple.
-_bt_killitems(IndexAnd, ScanDesc scan, bool haveLock)
+_bt_killitems(IndexScanDesc scan)

Even if I'm basically right about this making LP_DEAD hinting occur
less frequently in cases like the one from Alik, it might have
actually been worth it even from the point of view of index bloat,
because VACUUM operations complete more frequently, and so there is an
indirect boost. It's not as if we've been that great at assessing
problems like this before now, and so it bears considering if we could
do better here without introducing much additional complexity. I just
don't know.

I would like to hear opinions on:

How much of a difference might this have actually made?

If it's a significant factor for any important workload, what can be
done about it? Does anyone have any ideas about a less restrictive
strategy for avoiding spuriously killing an index tuple whose heap TID
was just recycled?

ISTM that by doing this LSN check, _bt_killitems() is least likely to
work when and where it is most needed.

The most obvious way to correctly proceed even when LSN has changed
would be to revisit the heap page when the leaf page LSN check doesn't
work out, and revalidate the safety of proceeding with killing tuples.
At least with unique indexes. It's probably extremely unlikely that
the problem that we must avoid here will actually be observed in the
real world, so it seems likely that an approach like this will be
worth it.

Peter Geoghegan

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to