On 01/12/2017 20:34, Justin Pryzby wrote:
On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote:
We recently had an issue in production, where a bitmap scan was chosen
instead of an index scan. Despite being 30x slower, the bitmap scan had
about the same cost as the index scan.
Me too, see also:

drop table if exists aaa;
create table aaa as select (id%100)::int num, (id%10=1)::bool flag from
generate_series(1, 10000000) id;
create index i1 on aaa  (num);
create index i2 on aaa  (flag);
analyze aaa;

select relname, reltuples::bigint, relpages::bigint,
(reltuples/relpages)::bigint tpp from pg_class where relname
in('aaa','i1','i2') order by relname;

The query was:
explain (analyze,verbose,costs,buffers)
select count(*) from aaa where num = 1 and flag = true;
Note that id%100==1 implies flag='t', so the planner anticipates retrieving
fewer rows than it will ultimately read, probably by 2x.  It makes sense that
causes the index scan to be more expensive than expected, but that's only
somewhat important, since there's no joins involved.
I don't think the planner is that smart to account for correlation between values in different columns. When different values are used in filter (num=2, num=39, num=74), the query actually runs faster, while still being about twice slower than using an index scan. But the cost does not change much. It jumps up and down for different values, but it's still close to the initial value.

Aggregate  (cost=24239.02..24239.03 rows=1 width=8) (actual time=105.239..105.239 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=3011
  ->  Bitmap Heap Scan on public.aaa  (cost=12812.05..24214.48 rows=9816 width=0) (actual time=105.236..105.236 rows=0 loops=1)
        Output: num, flag
        Recheck Cond: (aaa.num = 39)
        Filter: aaa.flag
        Buffers: shared hit=3011
        ->  BitmapAnd  (cost=12812.05..12812.05 rows=9816 width=0) (actual time=105.157..105.157 rows=0 loops=1)
              Buffers: shared hit=3011
              ->  Bitmap Index Scan on i1  (cost=0.00..1134.94 rows=97667 width=0) (actual time=15.725..15.725 rows=100000 loops=1)
                    Index Cond: (aaa.num = 39)
                    Buffers: shared hit=276
              ->  Bitmap Index Scan on i2  (cost=0.00..11671.96 rows=1005003 width=0) (actual time=77.920..77.920 rows=1000000 loops=1)
                    Index Cond: (aaa.flag = true)
                    Buffers: shared hit=2735
Planning time: 0.104 ms
Execution time: 105.553 ms

Aggregate  (cost=65785.99..65786.00 rows=1 width=8) (actual time=48.587..48.587 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=44524
  ->  Index Scan using i1 on public.aaa  (cost=0.44..65761.45 rows=9816 width=0) (actual time=48.583..48.583 rows=0 loops=1)
        Output: num, flag
        Index Cond: (aaa.num = 39)
        Filter: aaa.flag
        Rows Removed by Filter: 100000
        Buffers: shared hit=44524
Planning time: 0.097 ms
Execution time: 48.620 ms

The reason why it's more than a bit slower is due to the "density" [0] of the
heap pages read.  num=1 is more selective than flag=true, so it scans i1,
reading 1% of the whole table.  But it's not reading the first 1% or
some other 1% of the table, it reads tuples evenly distributed across the
entire table (226*0.01 = ~2 rows of each page).  Since the index was created
after the INSERT, the repeated keys (logical value: id%100) are read in
physical order on the heap, so this is basically doing a seq scan, but with the
additional overhead of reading the index, and maybe doing an fseek() before
each/some read()s, too.  You could confirm that by connecting strace to the
backend before starting the query.

Since you did that using % and with indices added after the INSERT, you can't
improve it by reindexing (as I was able to for my case).  That's an elegant
test case, so thanks.

I think shared_buffers=512MB is just small enough for this test to be painful
for 1e7 rows.  I see the table+index is 559MB.
           table           | ~count  |    size    |   toast |  idx   | size + toast + idx
 aaa                       | 9999994 | 346 MB     | 0 bytes    | 428 MB | 774 MB

But the plan says all buffers are "shared hit", and none "read", so that's probably not an issue.

I don't know if that's really similar to your production use case, but I would
recommend trying BRIN indices, which always require a bitmap scan.  Note that
some things (like max()) that can use an btree index cannot use brin.  PG10.1
has WITH autosummarize, which was important for our use, since we rarely do
UPDATEs or DELETEs so tables are rarely vacuumed (only analyzed).
Yes, BRIN indexes should be beneficial for our case, because there is a date column which grows over time (not strictly regularly, but still). Unfortunately, we're still migrating our databases from 9.3 to 9.6. Anyway, thanks for the advice.


[0] I'm borrowing Jeff's language from here:
"density" wasn't our problem, but it's a perfect description of this issue.

Reply via email to