Re: Index bloat and REINDEX/VACUUM optimization for partial index

Jeff Janes Tue, 29 Aug 2023 12:43:35 -0700

On Mon, Aug 28, 2023 at 8:33 PM jayaprabhakar k <[email protected]>
wrote:


> Hi,
>
> TL;DR:
> Observations:
>
>    1. REINDEX requires a full table scan
>       - Roughly create a new index, rename index, drop old index.
>       - REINDEX is not incremental. running reindex frequently does not
>       reduce the future reindex time.
>    2. REINDEX does not use the index itself
>    3. VACUUM does not clean up the indices. (relpages >> reltuples) I
>    understand, vacuum is supposed to remove pages only if there are no live
>    tuples in the page, but somehow, even immediately after vacuum, I see
>    relpages significantly greater than reltuples. I would have assumed,
>    relpages <= reltuples
>    4. Query Planner does not consider index bloat, so uses highly bloated
>    partial index that is terribly slow over other index
>
> Your points 3 and 4 are not correct.  empty index pages are put on a
freelist for future reuse, they are not physically removed from the
underlying index files.  Maybe they are not actually getting put on the
freelist or not being reused from the freelist for some reason, but that
would be a different issue.  Use the extension pgstattuple to see what its
function pgstatindex says about the index.

The planner does take index bloat into consideration, but its effect size
is low.  Which it should be, as empty or irrelevant pages should be
efficiently skipped during the course of most index operations. To figure
out what is going with your queries, you should do an EXPLAIN (ANALYZE,
BUFFERS) of them, but with it being slow and with it being fast.


> Question: Is there a way to optimize postgres vacuum/reindex when using
> partial indexes?
>

Without knowing what is actually going wrong, I can only offer
generalities.  Make sure you don't have long-lived transactions which
prevent efficient clean up.  Increase the frequency on which vacuum runs on
the table.  It can't reduce the size of an already bloated index, but by
keeping the freelist stocked it should be able prevent it from getting
bloated in the first place.  Also, it can remove empty pages from being
linked into the index tree structure, which means they won't need to be
scanned even though they are still in the file.  It can also free up space
inside non-empty pages for future reuse within that same page, and so that
index tuples don't need to be chased down in the table only to be found to
be not visible.


> ```
> SELECT [columns list]
>   FROM tasks
>   WHERE status NOT IN (3,4,5) AND created > NOW() - INTERVAL '30 days' AND
> updated < NOW() - interval '30 minutes'
> ```
>
> Since we are only interested in the pending tasks, I created a partial
> index
>  `*"tasks_pending_status_created_type_idx" btree (status, created,
> task_type) WHERE status <> ALL (ARRAY[3, 4, 5])*`.
>

This looks like a poorly designed index.  Since the status condition
exactly matches the index where clause, there is no residual point in
having "status" be the first column in the index, it can only get in the
way (for this particular query).  Move it to the end, or remove it
altogether.

Within the tuples which pass the status check, which inequality is more
selective, the "created" one or "updated" one?

Cheers,

Jeff

Re: Index bloat and REINDEX/VACUUM optimization for partial index

Reply via email to