Use of additional index columns in rows filtering

Maxim Ivanov Wed, 15 Feb 2023 00:58:01 -0800

Hi All,

I'd like to report what seems to be a missing optimization opportunity or 
understand why it is not possible to achieve.


TLDR; additional index column B specified in CREATE INDEX .. (A) INCLUDE(B) is 
not used to filter rows in queries like WHERE B = $1 ORDER BY A during 
IndexScan. https://dbfiddle.uk/iehtq44L


Take following database:


  CREATE TABLE t(
    a integer NOT NULL,
    b integer NOT NULL,
    d integer NOT NULL
  );

  CREATE UNIQUE INDEX t_a_include_b ON t (a) INCLUDE (b);
  -- I'd expect index above to behave as index below for the purpose
  -- of this query
  -- CREATE UNIQUE INDEX ON t(a,b);

  INSERT INTO t(
    SELECT random() * 100000000 as a,
           random() * 3 as b,
           generate_series as d FROM generate_series(1,200000)
  ) ON CONFLICT DO NOTHING;


If we filter on `a` and `b` columns while scanning index created as `(a) 
INCLUDE (b)` it seems to be fetching tuples from heap to check for condition `b 
= 4` despite both columns available in the index:

    SELECT * FROM  t WHERE a > 1000000 and b = 4 ORDER BY a ASC LIMIT 10;


Here is the plan (notice high "shared hit"):

  Limit  (cost=0.42..10955.01 rows=1 width=12) (actual time=84.283..84.284 
rows=0 loops=1)
    Output: a, b, d
    Buffers: shared hit=198307
    ->  Index Scan using t_a_include_b on public.t (cost=0.42..10955.01 rows=1 
width=12) (actual time=84.280..84.281 rows=0 loops=1)
          Output: a, b, d
          Index Cond: (t.a > 1000000)
          Filter: (t.b = 4)
          Rows Removed by Filter: 197805
          Buffers: shared hit=198307
  Planning:
    Buffers: shared hit=30
  Planning Time: 0.201 ms
  Execution Time: 84.303 ms


And here is the plan with index on (a,b).

  Limit  (cost=0.42..4447.90 rows=1 width=12) (actual time=6.883..6.884 rows=0 
loops=1)
    Output: a, b, d
    Buffers: shared hit=613
    ->  Index Scan using t_a_b_idx on public.t  (cost=0.42..4447.90 rows=1 
width=12) (actual time=6.880..6.881 rows=0 loops=1)
          Output: a, b, d
          Index Cond: ((t.a > 1000000) AND (t.b = 4))
          Buffers: shared hit=613
  Planning:
    Buffers: shared hit=41
  Planning Time: 0.314 ms
  Execution Time: 6.910 ms


Because query doesn't sort on `b`, only filters on it while sorting on `a`, I'd 
expect indexes `(a) INCLUDE (b)` and `(a,b)` behave exactly the same with this 
particular query.

Interestingly, IndexOnlyScan is capable of using additional columns to filter 
rows without fetching them from the heap, but only for visibible tuples:

  VACUUM FREEZE t;
  SELECT a,b FROM  t WHERE a > 1000000 and b = 4 ORDER BY a ASC LIMIT 10;

  Limit  (cost=0.42..6619.76 rows=1 width=8) (actual time=18.479..18.480 rows=0 
loops=1)
    Output: a, b
    Buffers: shared hit=662
    ->  Index Only Scan using t_a_include_b on public.t  (cost=0.42..6619.76 
rows=1 width=8) (actual time=18.477..18.477 rows=0 loops=1)
          Output: a, b
          Index Cond: (t.a > 1000000)
          Filter: (t.b = 4)
          Rows Removed by Filter: 197771
          Heap Fetches: 0
          Buffers: shared hit=662

Removing VACUUM makes it behave like IndexScan and fetch candidate tuples from 
heap all while returning zero rows in the result.


To make query plan comparable I had to force index scan on both with:

  SET enable_bitmapscan to off;
  SET enable_seqscan to off;
  SET max_parallel_workers_per_gather = 0;

Self contained fully reproducible example is in https://dbfiddle.uk/iehtq44L

Regards,
Maxim

Use of additional index columns in rows filtering

Reply via email to