On 2017-09-22 11:21, Masahiko Sawada wrote:
On Fri, Sep 22, 2017 at 4:16 PM, Kyotaro HORIGUCHI
<horiguchi.kyot...@lab.ntt.co.jp> wrote:
At Fri, 22 Sep 2017 15:00:20 +0900, Masahiko Sawada <sawada.m...@gmail.com> wrote in <cad21aod6zgb1w6ps1axj0ccab_chdyiitntedpmhkefgg13...@mail.gmail.com>
On Tue, Sep 19, 2017 at 3:31 PM, Kyotaro HORIGUCHI
<horiguchi.kyot...@lab.ntt.co.jp> wrote:
> I was just looking the thread since it is found left alone for a
> long time in the CF app.
> At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan <p...@bowt.ie> wrote in 
>> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund <and...@anarazel.de> wrote:
>> > Hi,
>> >
>> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote:
>> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas <robertmh...@gmail.com> 
>> >> [ lots of valuable discussion ]
>> >
>> > I think this patch clearly still is in the design stage, and has
>> > received plenty feedback this CF.  I'll therefore move this to the next
>> > commitfest.
>> Does anyone have ideas on a way forward here? I don't, but then I
>> haven't thought about it in detail in several months.
> Is the additional storage in metapage to store the current status
> of vaccum is still unacceptable even if it can avoid useless
> full-page scan on indexes especially for stable tables?
> Or, how about additional 1 bit in pg_stat_*_index to indicate
> that the index *don't* require vacuum cleanup stage. (default
> value causes cleanup)

You meant that "the next cycle" is the lazy_cleanup_index() function
called by lazy_scan_heap()?

Both finally call btvacuumscan under a certain condition, but
what I meant by "the next cycle" is the lazy_cleanup_index call
in the next round of vacuum since abstract layer (index am) isn't
conscious of the detail of btree.

> index_bulk_delete (or ambulkdelete) returns the flag in
> IndexBulkDeleteResult then lazy_scan_heap stores the flag in
> stats and in the next cycle it is looked up to decide the
> necessity of index cleanup.

Could you elaborate about this? For example in btree index, the index
cleanup skips to scan on the index scan if index_bulk_delete has been
called during vacuuming because stats != NULL. So I think we don't
need such a flag.

The flag works so that successive two index full scans don't
happen in a vacuum round. If any rows are fully deleted, just
following btvacuumcleanup does nothing.

I think what you wanted to solve here was the problem that
index_vacuum_cleanup runs a full scan even if it ends with no
actual work, when manual or anti-wraparound vacuums.  (I'm
getting a bit confused on this..) It is caused by using the
pointer "stats" as the flag to instruct to do that. If the
stats-as-a-flag worked as expected, the GUC doesn't seem to be

Hmm, my proposal is like that if a table doesn't changed since the
previous vacuum much we skip the cleaning up index.

If the table has at least one garbage we do the lazy_vacuum_index and
then IndexBulkDeleteResutl is stored, which causes to skip doing the
btvacuumcleanup. On the other hand, if the table doesn't have any
garbage but some new tuples inserted since the previous vacuum, we
don't do the lazy_vacuum_index but do the lazy_cleanup_index. In this
case, we always do the lazy_cleanup_index (i.g, we do the full scan)
even if only one tuple is inserted. That's why I proposed a new GUC
parameter which allows us to skip the lazy_cleanup_index in the case.

Addition to that, as Simon and Peter pointed out
index_bulk_delete can leave not-fully-removed pages (so-called
half-dead pages and pages that are recyclable but not registered
in FSM, AFAICS) in some cases mainly by RecentGlobalXmin
interlock. In this case, just inhibiting cleanup scan by a
threshold lets such dangling pages persist in the index. (I
conldn't make such a many dangling pages, though..)

The first patch in the mail (*1) does that. It seems having some
bugs, though..

Since the dangling pages persist until autovacuum decided to scan
the belonging table again, we should run a vacuum round (or
index_vacuum_cleanup itself) even having no dead rows if we want
to clean up such pages within a certain period. The second patch
doesn that.

IIUC half-dead pages are not relevant to this proposal. The proposal
has two problems;
* By skipping index cleanup we could leave recyclable pages that are
not marked as a recyclable.
* we stash an XID when a btree page is deleted, which is used to
determine when it's finally safe to recycle the page


Masahiko Sawada
NTT Open Source Software Center

Here is a small patch that skips scanning btree index if no pending
deleted pages exists.

It detects this situation by comparing pages_deleted with pages_free.
If they are equal, then there is no half-deleted pages, and it is
safe to skip next lazy scan.

Flag stored in a btpo_flags. It is unset using general wal before scan.
If no half-deleted pages found, it is set without wal (like hint bit).
(it is safe to miss setting flag, but it is not safe to miss unsetting

This patch works correctly:
- if rows are only inserted and never deleted, index always skips
scanning (starting with second scan).
- if some rows updated/deleted, then some scans are not skipped. But
when all half-deleted pages are marked as deleted, lazy scans start to
be skipped.

Open question: what to do with index statistic? For simplicity this
patch skips updating stats (just returns NULL from btvacuumcleanup).
Probably, it should detect proportion of table changes, and do not
skip scans if table grows too much.

Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
From 5f7f04883b8769f424ff98aff21689e640eb3d94 Mon Sep 17 00:00:00 2001
From: Sokolov Yura <funny.fal...@postgrespro.ru>
Date: Tue, 1 Aug 2017 15:57:40 +0300
Subject: [PATCH] lazy btree scan

 src/backend/access/nbtree/nbtree.c | 59 ++++++++++++++++++++++++++++++++++++++
 src/include/access/nbtree.h        |  2 ++
 2 files changed, 61 insertions(+)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3dbafdd6fc..4216dbe807 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -18,6 +18,7 @@
 #include "postgres.h"
+#include "access/generic_xlog.h"
 #include "access/nbtree.h"
 #include "access/relscan.h"
 #include "access/xlog.h"
@@ -937,6 +938,56 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	return stats;
+static bool
+btneednocleanup(Relation rel)
+	Buffer		metabuf;
+	Page		metapg;
+	BTPageOpaque metaopaque;
+	bool need_no_cleanup;
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metapg = BufferGetPage(metabuf);
+	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
+	need_no_cleanup = P_NEED_NO_CLEANUP(metaopaque);
+	_bt_relbuf(rel, metabuf);
+	return need_no_cleanup;
+static void
+btsetneednocleanup(Relation rel, bool need_no_cleanup)
+	Buffer		metabuf;
+	Page		metapg;
+	BTPageOpaque metaopaque;
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metapg = BufferGetPage(metabuf);
+	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
+	if (need_no_cleanup && !P_NEED_NO_CLEANUP(metaopaque))
+	{
+		metaopaque->btpo_flags |= BTP_NEED_NO_CLEANUP;
+		MarkBufferDirtyHint(metabuf, true);
+	}
+	else if (!need_no_cleanup && P_NEED_NO_CLEANUP(metaopaque))
+	{
+		GenericXLogState *state;
+		Page		pg;
+		BTPageOpaque opaque;
+		XLogRecPtr	recptr;
+		state = GenericXLogStart(rel);
+		pg = GenericXLogRegisterBuffer(state, metabuf, 0);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(pg);
+		opaque->btpo_flags &= ~BTP_NEED_NO_CLEANUP;
+		recptr = GenericXLogFinish(state);
+		PageSetLSN(metapg, recptr);
+		MarkBufferDirty(metabuf);
+	}
+	_bt_relbuf(rel, metabuf);
  * Post-VACUUM cleanup.
@@ -960,6 +1011,8 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
+		if (btneednocleanup(info->index))
+			return NULL;
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
 		btvacuumscan(info, stats, NULL, NULL, 0);
@@ -1028,6 +1081,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+	/* Reset "need no cleanup" flag */
+	btsetneednocleanup(rel, false);
 	 * The outer loop iterates over all index pages except the metapage, in
 	 * physical order (we hope the kernel will cooperate in providing
@@ -1111,6 +1167,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	/* update statistics */
 	stats->num_pages = num_pages;
 	stats->pages_free = vstate.totFreePages;
+	if (stats->pages_deleted == stats->pages_free)
+		btsetneednocleanup(rel, true);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e6abbec280..69d34cbc71 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -75,6 +75,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
 #define BTP_SPLIT_END	(1 << 5)	/* rightmost page of split group */
 #define BTP_HAS_GARBAGE (1 << 6)	/* page has LP_DEAD tuples */
 #define BTP_INCOMPLETE_SPLIT (1 << 7)	/* right sibling's downlink is missing */
+#define BTP_NEED_NO_CLEANUP (1 << 14)	/* tree has dead pages, so cleanup is needed */
  * The max allowed value of a cycle ID is a bit less than 64K.  This is
@@ -181,6 +182,7 @@ typedef struct BTMetaPageData
 #define P_IGNORE(opaque)		((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
 #define P_HAS_GARBAGE(opaque)	((opaque)->btpo_flags & BTP_HAS_GARBAGE)
 #define P_INCOMPLETE_SPLIT(opaque)	((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT)
+#define P_NEED_NO_CLEANUP(opaque)	(((opaque)->btpo_flags & BTP_NEED_NO_CLEANUP) != 0)
  *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to