On Tue, Dec 12, 2017 at 01:29:48AM -0800, Jeff Janes wrote:
> On Wed, Dec 6, 2017 at 1:46 PM, Justin Pryzby <pry...@telsasoft.com> wrote:
> > On Tue, Dec 05, 2017 at 01:50:11PM -0500, Tom Lane wrote:

> > > In any case, given that we do this calculation without regard
> > > to any specific index,
> >
> > One solution is to compute stats (at least correlation) for all indices,
> > not
> > just expr inds.  I did that earlier this year while throwing around/out
> > ideas.
> > https://www.postgresql.org/message-id/20170707234119.
> > GN17566%40telsasoft.com
> When is the correlation of a column which is not the leading column of a
> btree index or in a brin index ever used?  If we did compute index-specific
> correlations, we could maybe just drop pure-column correlations.

Yes I think so - correlation is collected for every column, but only used for

I also have a comment to myself in that patch to force attstattarget=0 for
non-expr indices, to avoid keeping MCV/histogram which duplicates that of their

> Trying to keep it all in my own head: For sufficiently large number of
> > pages,
> > bitmap scan should be preferred to idx scan due to reduced random-page-cost
> > outweighing its overhead in CPU cost.
> But CPU cost is probably not why it is losing anyway.
> Index scans get a double bonus from high correlation.  It assumes that only
> a small fraction of the table will be visited.  And then it assumes that
> the visits it does make will be largely sequential.  I think that you are
> saying that for a large enough table, that last assumption is wrong, that
> the residual amount of non-correlation is enough to make the table reads
> more random than sequential.  Maybe.  Do you have a test case that
> demonstrates this?  If so, how big do we need to go, and can you see the
> problem on SSD as well as HDD?

Right: The "residual"/fine-scale variations (those which are not adequately
represented by correlation metric) are/may be non-sequential, so don't get good

The original issue was with an 75GB table (an inheritence child) and an
analytic query previously taking ~30min at that point taking 4-5 hours due to
random seeks (from duplicate values in a timestamp column with 1second
resolution).  There would've been very little if any of the previous day's
table cached: the child table being queried (by way of its parent) had size
roughly same as the server's RAM, and would've been loaded over the course of
the preceding 6-30hours, and not frequently accessed.  It may be that there's a
sharp change once cache no longer effectively mitigates the random heap reads.

SSD: good question.

Here's an rackspace VM with PG9.6.6, 2GB shared_buffers, 8GB RAM (~4GB of which
is being used as OS page cache), and 32GB SSD (with random_page_cost=1).  The
server is in use by our application.

I believe you could scale up the size of the table to see this behavior with
any cache size.  0.0001 controls the "jitter", with smaller values being more

postgres=# CREATE TABLE t(i int,j int) TABLESPACE tmp; CREATE INDEX ON t(i); 
INSERT INTO t SELECT (0.0001*a+9*(random()-0.5))::int FROM 
generate_series(1,99999999) a; VACUUM ANALYZE t;
 public | t    | table | pryzbyj | 3458 MB |
relpages | 442478

For comparison purposes/baseline; here's a scan on an SEPARATE index freshly
built AFTER insertions:

postgres=# explain(analyze,buffers) SELECT COUNT(j) FROM t WHERE i BETWEEN 0 
AND 4000;
First invocation:
#1  ->  Index Scan using t_i_idx1 on t  (cost=0.57..1413352.60 rows=39933001 
width=4) (actual time=25.660..52575.127 rows=39996029 loops=1)
         Buffers: shared hit=1578644 read=286489 written=1084
Subsequent invocations with (extra) effect from OS cache:
#2  ->  Index Scan using t_i_idx1 on t  (cost=0.57..1413352.60 rows=39933001 
width=4) (actual time=61.054..37646.556 rows=39996029 loops=1)
         Buffers: shared hit=1578644 read=286489 written=2223
#3  ->  Index Scan using t_i_idx1 on t  (cost=0.57..1413352.60 rows=39933001 
width=4) (actual time=9.344..31265.398 rows=39996029 loops=1)
         Buffers: shared hit=1578644 read=286489 written=1192

Dropping that index, and scanning a different range on the non-fresh index:

postgres=# explain(analyze,buffers) SELECT COUNT(j) FROM t WHERE i BETWEEN 4000 
AND 8000;
#1  ->  Index Scan using t_i_idx on t  (cost=0.57..1546440.47 rows=40298277 
width=4) (actual time=95.815..139152.147 rows=40009853 loops=1)
         Buffers: shared hit=1948069 read=316536 written=3411
Rerunning with cache effects:
#2  ->  Index Scan using t_i_idx on t  (cost=0.57..1546440.47 rows=40298277 
width=4) (actual time=203.590..87547.287 rows=40009853 loops=1)
         Buffers: shared hit=1948069 read=316536 written=5712
#3  ->  Index Scan using t_i_idx on t  (cost=0.57..1546440.47 rows=40298277 
width=4) (actual time=164.504..83768.890 rows=40009853 loops=1)
         Buffers: shared hit=1948069 read=316536 written=1979

Compare to seq scan:
   ->  Seq Scan on t  (cost=0.00..1942478.00 rows=40298277 width=4) (actual 
time=1173.162..20980.069 rows=40009853 loops=1)
         Buffers: shared hit=47341 read=395137

   ->  Bitmap Heap Scan on t  (cost=975197.91..2022150.06 rows=40298277 
width=4) (actual time=24396.270..39304.813 rows=40009853 loops=1)
         Buffers: shared read=316536 written=1431

The index scan reads 2.3e6 pages, compared to 4e5 pages (seq) and 3e5 pages
(bitmap).  And idx scans were 4-7x slower than seq scan.  Was the index scan
actually that badly affected by CPU cost of revisiting pages (rather than IO
costs)?  Or did the OS actually fail to cache the 3e5 pages "read"?  That would
be consistent with running almost 2x faster on the 3rd invocation.

The "hits" are largely from pages being revisited and recently accessed.  Are
the misses (reads) mostly from pages being revisited after already falling out
of cache ?  Or mostly initial access ?  Or ??

If the index scan is really paying an high IO cost for rereads and not
primarily a CPU cost, this seems to be something like the "correlated index
scan" variant of traditional failure to effectively cache doing seq scan on a
sufficiently large table using a MRU buffer - the cache is ineffective for
adequately mitigating IO costs when the (re)reads have sufficient "spread".

postgres=# SELECT tablename, attname, correlation FROM pg_stats WHERE 
tablename   | t
attname     | i
correlation | 1

I still want to say that's unreasonble due to (I think) high fraction of
nonrandom reads and associated absense of readahead.

> I think it would be easier to give bitmap scans their due rather than try
> to knock down index scans.  But of course a synthetic test case would go a
> long way to either one.

As Tom said: index scans involving repeated keys are assuming best-case
sequential reads for a given computed correlation.  I'd be happy if they were
costed to avoid that assumption, and instead used some "middle of the road"
interpretation (probably based on correlation and MCV fraction?), but, in the
alternate, need to distinguish the cost_index cases, rather than adjusting
bitmap.  This is what led me to play around with stats computation for all


Reply via email to