On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote:
> We recently had an issue in production, where a bitmap scan was chosen
> instead of an index scan. Despite being 30x slower, the bitmap scan had
> about the same cost as the index scan.

Me too, see also:

> drop table if exists aaa;
> create table aaa as select (id%100)::int num, (id%10=1)::bool flag from
> generate_series(1, 10000000) id;
> create index i1 on aaa  (num);
> create index i2 on aaa  (flag);
> analyze aaa;
> select relname, reltuples::bigint, relpages::bigint,
> (reltuples/relpages)::bigint tpp from pg_class where relname
> in('aaa','i1','i2') order by relname;
> "aaa";9999985;44248;226
> "i1";9999985;27422;365
> "i2";9999985;27422;365
> The query was:
> explain (analyze,verbose,costs,buffers)
> select count(*) from aaa where num = 1 and flag = true;

Note that id%100==1 implies flag='t', so the planner anticipates retrieving
fewer rows than it will ultimately read, probably by 2x.  It makes sense that
causes the index scan to be more expensive than expected, but that's only
somewhat important, since there's no joins involved.

The reason why it's more than a bit slower is due to the "density" [0] of the
heap pages read.  num=1 is more selective than flag=true, so it scans i1,
reading 1% of the whole table.  But it's not reading the first 1% or 
some other 1% of the table, it reads tuples evenly distributed across the
entire table (226*0.01 = ~2 rows of each page).  Since the index was created
after the INSERT, the repeated keys (logical value: id%100) are read in
physical order on the heap, so this is basically doing a seq scan, but with the
additional overhead of reading the index, and maybe doing an fseek() before
each/some read()s, too.  You could confirm that by connecting strace to the
backend before starting the query.

Since you did that using % and with indices added after the INSERT, you can't
improve it by reindexing (as I was able to for my case).  That's an elegant
test case, so thanks.

I think shared_buffers=512MB is just small enough for this test to be painful
for 1e7 rows.  I see the table+index is 559MB.

I don't know if that's really similar to your production use case, but I would
recommend trying BRIN indices, which always require a bitmap scan.  Note that
some things (like max()) that can use an btree index cannot use brin.  PG10.1
has WITH autosummarize, which was important for our use, since we rarely do
UPDATEs or DELETEs so tables are rarely vacuumed (only analyzed).


[0] I'm borrowing Jeff's language from here:
"density" wasn't our problem, but it's a perfect description of this issue.

Reply via email to