Months ago I reported an issue with very slow index scan of tables with high
"correlation" of its indexed column, due to (we concluded at the time)
duplicate/repeated values of that column causing many lseek()s. References:
https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#[email protected]
https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com#[email protected]
https://www.postgresql.org/message-id/flat/n6cmpug13b9rk1srebjvhphg0lm8dou1kn%404ax.com#[email protected]
My interim workarounds were an reindex job and daily granularity partitions
(despite having an excessive number of child tables) to query execution time]]
to encourage seq scans for daily analysis jobs rather than idx scan. I've now
cobbled together a patch to throw around and see if we can improve on that. I
tried several changes, hoping to discourage index scan. The logic that seems
to most accurately reflect costs has two changes:
Postgres' existing behavior stores table correlation (heap value vs. position)
but doesn't look at index correlation, so can't distinguish between a just-built
index, and a highly fragmented index, or one with highly-nonsequential TIDs.
My patch causes ANALYZE to do a TID scan (sampled across the MCVs) to determine
correlation of heap TID vs index tuple logical location (as opposed to the
table correlation, computed as: heap TID vs. heap value).
The second change averages separate correlation values of small lists of 300
consecutive TIDs, rather than the course-granularity/aggregate correlation of a
small sample of pages across the entire index. Postgres' existing sampling is
designed to give an even sample across all rows. An issue with this
course-granularity correlation is that the index can have a broad correlation
(to physical heap location), but with many small-scale deviations, which don't
show up due to sampling a relatively small fraction of a large table; and/or
the effect of the deviations is insignificant/noise and correlation is still
computed near 1.
I believe the "large scale" correlation that postgres computes from block
sample fails to well represent small-scale uncorrelated reads which contribute
large number of random reads, but not included in planner cost.
Not only are the index reads highly random (which the planner already assumes),
but the CTIDs referenced within a given btree leaf page are also substantially
non-sequential. It seems to affect INSERTs which, over a short interval of
time have a small-moderate range of column values, but which still have a
strong overall trend in that column WRT time (and heap location). Think of
inserting a "normal" distribution of timestamps centered around now().
My original report shows lseek() for nearly every read on the *heap*. The
original problem was on a table of telecommunications CDRs, indexed by "record
opening time" (start time) and with high correlation value in pg_stats. We
import data records from a file, which is probably more or less in order of
"end time".
That still displays broad correlation on "start time", but individual TIDs are
nowhere near sequential. Two phone calls which end in the same 1 second
interval are not unlikely to have started many minutes apart... but two calls
which end within the same second are very likely to have started within an hour
of each other.. since typical duration is <1h. But, insert volume is high, and
there are substantial repeated keys, so the portion of an index scan returning
CTIDs for some certain key value (timestamp with second resolution in our case)
ends up reading heap tuples for a non-negligible fraction of the table: maybe
only 1%, but that's 300 seeks across 1% of a table which is 10s of GB ...
which is still 300 seeks, and takes long enough that cache is inadequate to
substantially mitigate the cost.
It's not clear to me that there's a good way to evenly *sample* a fraction of
the index blocks in a manner which is agnostic to different AMs. Scanning the
entirety of all indices on relations during (auto) analyze may be too
expensive. So I implemented index scan of the MCV list. I'm guessing this
might cause the correlation to be under-estimated, and prefer bitmap scans
somewhat more than justified, due to btree insertion logic for repeated keys,
to reduce O(n^2) behavior. MCV list isn't perfect since that can happen with
eg. normally distributed floating point values (or timestamp with fractional
seconds).
I ran pageinspect on a recent child of the table that triggered the original
report:
ts=# SELECT itemoffset, ctid FROM
bt_page_items('cdrs_huawei_pgwrecord_2017_07_07_recordopeningtime_idx',6) LIMIT
22 OFFSET 1;
itemoffset | ctid
------------+--------
2 | (81,4)
3 | (5,6)
4 | (6,5)
5 | (9,1)
6 | (10,1)
7 | (14,1)
8 | (21,8)
9 | (25,1)
10 | (30,5)
11 | (41,1)
12 | (48,7)
13 | (49,3)
14 | (50,2)
15 | (51,1)
16 | (54,1)
17 | (57,7)
18 | (58,6)
19 | (59,6)
20 | (63,5)
21 | (71,6)
22 | (72,5)
23 | (78,7)
(22 rows)
=> 22 adjacent tuples referencing 22 different heap pages, only 6 of which are
sequential => 16 lseek() aka random page cost.
To generate data demonstrating this problem you can do things like:
| CREATE TABLE t(i int, j int);
| CREATE INDEX ON t(i);
\! time for a in `seq 99`; do psql -qc 'INSERT INTO t SELECT * FROM
generate_series(1,99)'; done -- or,
or:
| INSERT INTO t SELECT (0.001*a+9*(random()-0.5))::int FROM
generate_series(1,99999) a; -- or,
or this one, perhaps closest to our problem case:
| INSERT INTO t SELECT a FROM generate_series(1,999) a, generate_series(1,999)
b ORDER BY a+b/9.9;
Note, I was able to create a case using floats without repeated keys:
| INSERT INTO w SELECT i/99999.0+pow(2,(-random())) FROM
generate_series(1,999999) i ORDER BY i
| ANALYZE t;
-- note: the correlation is even higher for larger tables with same number of
-- repeated keys, which is bad since the cost should go up linearly with the
-- size and associated number of lseek(). That's one component of the problem
-- and I think a necessarily component of any solution.
postgres=# SELECT tablename, attname, correlation,
array_length(most_common_vals,1) FROM pg_stats WHERE tablename LIKE 't%';
tablename | attname | correlation | array_length
-----------+---------+-------------+--------------
t | i | 0.995212 | 87
t | j | |
t_i_idx | i | -0.892874 | 87
^^^^^^^
... this last line is added by my patch.
That's a bad combination, high table correlation means the planner thinks only
a fraction of the heap will be read, and sequentially, so isn't discouraged
from doing index scan. But index TIDs are much less well correlated (0.89 vs
0.99).
Note: the negative correlation at tuple-level seems to be related to this
comment:
src/backend/access/nbtree/nbtinsert.c- * Once we have chosen the
page to put the key on, we'll insert it before
src/backend/access/nbtree/nbtinsert.c: * any existing equal keys
because of the way _bt_binsrch() works.
Note also:
|postgres=# SELECT leaf_fragmentation FROM pgstatindex('t_i_idx');
|leaf_fragmentation | 67.63
.. but keep in mind that leaf_fragmentation only checks leaf page order, and
not CTID order within index pages. The planner already assumes index pages are
random reads. Maybe it's a good indicator (?), but doesn't lend itself to an
accurate cost estimation.
For testing purposes, with:
| shared_buffers | 128kB
| public | t | table | pryzbyj | 35 MB |
| SET enable_bitmapscan=off;
| SET enable_seqscan=off;
| SELECT pg_backend_pid();
| SELECT sum(j) FROM t WHERE i<99;
| Time: 3974.205 ms
% sudo strace -p 530 2>/tmp/strace-pg
% sed -r '/^\(read|lseek/!d; s/ .*//' /tmp/strace-pg |sort |uniq -c |sort -nr
|head
39474 lseek(41,
359 lseek(44,
8 lseek(18,
=> 40k seeks on an 35MB table, that's 10 seeks per heap page!
open("base/12411/16634", O_RDWR) = 41
open("base/12411/16636", O_RDWR) = 44
postgres=# SELECT relfilenode, relname FROM pg_class WHERE relfilenode>16630;
16634 | t
16636 | t_i_idx
2017-07-07 17:45:54.075 CDT [6360] WARNING: HERE 1222: csquared=0.797225
minIO/R-P-C=109.750000 maxIO/R-P-C=4416.000000 costsize.c cost_index 702
With my patch, index scan estimated to take ~4400 seeks, rather than actual
40k, probably because max_IO_cost assumes that a given heap page will be
visited only once. But in the worst case each of its tuples would require a
separate lseek().... I'm not suggesting to change that, since making max_IO
100x higher will probably change query plans more dramatically than desired..
But also note that, unpatched, with table correlation >0.99, postgres would've
under-estimated min_IO_cost not by a factor of 10x but by 400x.
| postgres=# REINDEX INDEX t_i_idx;
| postgres=# ANALYZE t;
| postgres=# SELECT tablename, attname, correlation,
array_length(most_common_vals,1) FROM pg_stats WHERE tablename LIKE 't%';
| tablename | attname | correlation | array_length
| -----------+---------+-------------+--------------
| t | i | 0.995235 | 67
| t_i_idx | i | 1 | 67
=> Correctly distinguishes freshly reindexed table.
% sudo strace -p 6428 2>/tmp/strace-pg8
% sed -r '/^\(read|lseek/!d; s/ .*//' /tmp/strace-pg8 |sort |uniq -c |sort -nr
99 lseek(37,
2017-07-07 17:49:47.745 CDT [6428] WARNING: HERE 1222: csquared=1.000000
minIO/R-P-C=108.500000 maxIO/R-P-C=4416.000000 costsize.c cost_index 702
=> csquared=1, so min_IO cost is used instead of something close to max_IO
(this patch doesn't change the computation of those values at all, just changes
correlation, which squared value is used to interpolate between correlated/min
cost and uncorrelated/max cost).
correlated estimate: 108 seeks vs 99 actual, is essentially what unpatched
postgres would've computed by using the table correlation of 0.9952, implicitly
assuming the index to be similarly correlated.
I hope that demonstrates the need to distinguish index correlation from table
correlation. I'm hoping for comments on the existing patch, specifically if
there's a better way to sample the index than "MCVs or partial index scan".
I've left some fragments in place of earlier implementation involving btree
page level fragmentation (like statindex). Also: does it make sense to keeping
MCV/histogram for non-expression index which duplicates table column ? The
stats lookup in selfuncs.c btcostestimate() would have to check for correlation
from index, and rest of stats from its table.
A bitmap layer adds overhead, which should be avoided if not needed. But it
shouldn't be a huge impact, and I think its relative effect is only high when
returning a small number of rows. I'm thinking of a few cases.
- unique / low cardinality index scans or without duplicate keys: should have
low index_pages_fetched(), so max_IO_cost should not be very different from
min_io_cost, and new index-based correlation shouldn't have much effect
different from table correlation.
- unclustered/uncorrelated tables: tables whose heap have low correlation
already discouraged from index scan; this includes tables whose column is
UPDATEd and not just INSERTed;
- table with correlated heap AND index: csquared should still be ~0.99 and not
change much;
- correlated table with uncorrelated index: this is the target case with
intended behavior change
My apologies: I think that's a bit of a brain dump.
Justin
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2b63827..1acf118 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -98,6 +98,11 @@ static int compare_rows(const void *a, const void *b);
static int acquire_inherited_sample_rows(Relation onerel, int elevel,
HeapTuple *rows, int
targrows,
double *totalrows,
double *totaldeadrows);
+static int acquire_sample_index(Relation onerel, Relation index, int elevel,
+ HeapTuple *rows, int targrows,
+ double *totalrows, double
*totaldeadrows);
+static double correlation(int values_cnt, double corr_xysum);
+static double idx_correlation (Oid indexoid, VacAttrStatsP stats, int
targrows, int num_mcvs, Datum *mcv_values);
static void update_attstats(Oid relid, bool inh,
int natts, VacAttrStats **vacattrstats);
static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
@@ -206,6 +211,7 @@ analyze_rel(Oid relid, RangeVar *relation, int options,
* used to do this in get_rel_oids() but seems safer to check after
we've
* locked the relation.
*/
+ // TODO: if (onerel->rd_rel->relkind == RELKIND_INDEX) => needs a
separate sample_rows function?
if (onerel->rd_rel->relkind == RELKIND_RELATION ||
onerel->rd_rel->relkind == RELKIND_MATVIEW)
{
@@ -433,37 +439,41 @@ do_analyze_rel(Relation onerel, int options, VacuumParams
*params,
{
AnlIndexData *thisdata = &indexdata[ind];
IndexInfo *indexInfo;
+ ListCell *indexpr_item;
thisdata->indexInfo = indexInfo =
BuildIndexInfo(Irel[ind]);
thisdata->tupleFract = 1.0; /* fix later if partial */
- if (indexInfo->ii_Expressions != NIL && va_cols == NIL)
+ if (va_cols != NIL)
{
- ListCell *indexpr_item =
list_head(indexInfo->ii_Expressions);
+ continue;
+ }
- thisdata->vacattrstats = (VacAttrStats **)
+ indexpr_item = list_head(indexInfo->ii_Expressions);
+ thisdata->vacattrstats = (VacAttrStats **)
palloc(indexInfo->ii_NumIndexAttrs *
sizeof(VacAttrStats *));
- tcnt = 0;
- for (i = 0; i < indexInfo->ii_NumIndexAttrs;
i++)
- {
- int keycol =
indexInfo->ii_KeyAttrNumbers[i];
+ tcnt = 0;
+ for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+ {
+ int keycol =
indexInfo->ii_KeyAttrNumbers[i];
+ Node *indexkey=NULL;
- if (keycol == 0)
- {
- /* Found an index expression */
- Node *indexkey;
-
- if (indexpr_item == NULL)
/* shouldn't happen */
- elog(ERROR, "too few
entries in indexprs list");
- indexkey = (Node *)
lfirst(indexpr_item);
- indexpr_item =
lnext(indexpr_item);
- thisdata->vacattrstats[tcnt] =
-
examine_attribute(Irel[ind], i + 1, indexkey);
- if
(thisdata->vacattrstats[tcnt] != NULL)
- tcnt++;
- }
+ if (keycol == 0)
+ {
+ /* Found an index expression */
+ if (indexpr_item == NULL) /*
shouldn't happen */
+ elog(ERROR, "too few entries in
indexprs list");
+ indexkey = (Node *)
lfirst(indexpr_item);
+ indexpr_item = lnext(indexpr_item);
}
- thisdata->attr_cnt = tcnt;
+
+ thisdata->vacattrstats[tcnt] =
+ examine_attribute(Irel[ind], i + 1,
indexkey);
+
+ if (thisdata->vacattrstats[tcnt] != NULL)
+ tcnt++;
}
+
+ thisdata->attr_cnt = tcnt;
}
}
@@ -548,11 +558,38 @@ do_analyze_rel(Relation onerel, int options, VacuumParams
*params,
MemoryContextResetAndDeleteChildren(col_context);
}
+# if 0
+ for (ind = 0; ind < nindexes; ind++)
+ {
+ int
indnumrows;
+ double indtotalrows,
+
indtotaldeadrows;
+ HeapTuple *indrows= palloc(targrows *
sizeof(HeapTuple));
+ /*
+ * Attempted to do an index scan for each index, in
order to get
+ * correllation stats of the HEAP ; I was thinking
+ * (probably incorrectly) that the high random cost was
+ * due to index seeks, and that it was needed to
+ * traverse the index in logical rather than physical
+ * order to determine its correction. But the planner
+ * already assumes that index reads are random.
+ */
+
+ indnumrows = acquire_sample_index (onerel, Irel[ind],
elevel,
+ indrows, targrows, &indtotalrows,
&indtotaldeadrows);
+ compute_index_stats(Irel[ind], indnumrows,
+ indexdata+ind, 1, /* XXX: should update
c_i_s to do one index only? */
+ indrows, indnumrows,
+ col_context);
+
+ }
+# else
if (hasindex)
- compute_index_stats(onerel, totalrows,
- indexdata,
nindexes,
- rows, numrows,
- col_context);
+ compute_index_stats(onerel, numrows,
+ indexdata, nindexes,
+ rows, numrows,
+ col_context);
+# endif
MemoryContextSwitchTo(old_context);
MemoryContextDelete(col_context);
@@ -721,10 +758,6 @@ compute_index_stats(Relation onerel, double totalrows,
rowno;
double totalindexrows;
- /* Ignore index if no columns to analyze and not partial */
- if (attr_cnt == 0 && indexInfo->ii_Predicate == NIL)
- continue;
-
/*
* Need an EState for evaluation of index expressions and
* partial-index predicates. Create it in the per-index
context to be
@@ -767,6 +800,7 @@ compute_index_stats(Relation onerel, double totalrows,
if (!ExecQual(predicate, econtext))
continue;
}
+
numindexrows++;
if (attr_cnt > 0)
@@ -790,6 +824,7 @@ compute_index_stats(Relation onerel, double totalrows,
VacAttrStats *stats =
thisdata->vacattrstats[i];
int attnum =
stats->attr->attnum;
+ stats->rows=rows;
if (isnull[attnum - 1])
{
exprvals[tcnt] = (Datum) 0;
@@ -815,7 +850,7 @@ compute_index_stats(Relation onerel, double totalrows,
totalindexrows = ceil(thisdata->tupleFract * totalrows);
/*
- * Now we can compute the statistics for the expression columns.
+ * Now we can compute the statistics.
*/
if (numindexrows > 0)
{
@@ -830,11 +865,16 @@ compute_index_stats(Relation onerel, double totalrows,
stats->exprvals = exprvals + i;
stats->exprnulls = exprnulls + i;
stats->rowstride = attr_cnt;
+// Low stats target for non-expr indices to avoid having an duplicate MCV
and/or hist ?
+// avoid keeping an (duplicate) MCV list or
+// histogram for non-expression columns?
+// if (indexInfo->ii_KeyAttrNumbers[i]!=0) // && stats->attr->attstattarget!=0)
+// stats->attr->attstattarget=0;
+
(*stats->compute_stats) (stats,
ind_fetch_func,
numindexrows,
totalindexrows);
-
/*
* If the n_distinct option is specified, it
overrides the
* above computation. For indices, we always
use just
@@ -896,7 +936,7 @@ examine_attribute(Relation onerel, int attnum, Node
*index_expr)
/*
* When analyzing an expression index, believe the expression tree's
type
* not the column datatype --- the latter might be the opckeytype
storage
- * type of the opclass, which is not interesting for our purposes.
(Note:
+ * type of the opclass, which is not interesting for our purposes. XXX
(Note:
* if we did anything with non-expression index columns, we'd need to
* figure out where to get the correct type info from, but for now
that's
* not a problem.) It's not clear whether anyone will care about
the
@@ -1478,6 +1518,305 @@ acquire_inherited_sample_rows(Relation onerel, int
elevel,
}
+// this doesn't work, since broad/course correlation doesn't well represent
+// non-sequential heap page access which may happen at a small scale..
+static int
+acquire_sample_index(Relation onerel, Relation index, int elevel,
+ HeapTuple *rows, int targrows,
+ double *totalrows, double
*totaldeadrows)
+{
+ int numrows = 0; /* # rows now in reservoir */
+ double samplerows = 0; /* total # rows collected */
+ double rowstoskip = -1; /* -1 means not set yet */
+
+ HeapTuple targtuple;
+ // Snapshot snapshot =
RegisterSnapshot(GetTransactionSnapshot());
+ // Snapshot snapshot = GetLatestSnapshot();
+ Snapshot snapshot = GetActiveSnapshot();
+ // Snapshot snapshot = GetOldestSnapshot();
+
+ /* Prepare for sampling rows */
+ ReservoirStateData rstate;
+ IndexScanDesc index_scan;
+ reservoir_init_selection_state(&rstate, targrows);
+
+ // ScanKeyData scankeys[1];
+ // ScanKeyEntryInitialize(scankeys, SK_ISNULL|SK_SEARCHNOTNULL, 1,
InvalidStrategy, InvalidOid, InvalidOid, InvalidOid, (Datum) 0);
+ // XXX search not null ??
+ // SK_SEARCHNOTNULL
+ index_scan= index_beginscan(onerel, index, snapshot, 0, 0);
+ // index_rescan(index_scan, scankeys, 0, NULL, 0);
+ index_rescan(index_scan, NULL, 0, NULL, 0);
+
+ // while ((targtuple = index_getnext(index_scan, ForwardScanDirection))
!= NULL)
+ // XXX: it's pretty unfortunate to do a full index scan for each table
+ // being analyzed ... esp. since a design goal is to avoid very
+ // inefficient random read patterns during index scans... consider
+ // alternatives like index scanning for each MCV ? However I suspect
+ // that would tend to strongly underestimate correlation in some cases
+ // and too strongly discourage index scans..
+ // Probably go back to a block sample of leaf pages, counting the
+ // linearity of its referenced heap blocks within an index page, but
+ // not from one index page to another???
+
+ // for (unsigned long int c=0; (targtuple = index_getnext(index_scan,
BackwardScanDirection)) != NULL ; ++c) // Is there a more useful value to use
here ???
+ for (unsigned long int c=0; (targtuple = index_getnext(index_scan,
ForwardScanDirection)) != NULL ; ++c) // Is there a more useful value to use
here ???
+ // ItemPointer tid;
+ // while ((tid = index_getnext_tid(index_scan, ForwardScanDirection))
!= NULL)
+ {
+ HeapTuple h = heap_copytuple(targtuple);
+
+ // ItemPointerSet(&(h->t_self), c>>16, c&0xffff );
+ // index_fetch_heap(index_scan);
+
+ ItemPointerSet(&(h->t_self), c>>16, c&0xffff ); // no good, the
correlation of heap value WRT index location is 100% ...
+ // h->t_self.ip_blkid.bi_lo=c&0xffff; // >>8; //
ItemPointerSet(&(h->t_self), c>>16, c&0xffff );
+
+ /* Vitter's algorithm, see above */
+ // ItemPointerSetBlockNumber(&h->t_self, c);
+ // ItemPointerSetOffsetNumber(&h->t_self, c);
+ // h->t_self.ip_blkid.bi_hi=c>>16;
+ // h->t_self.ip_blkid.bi_lo=c&OffsetNumberMask;
+ // h->t_self.ip_blkid.bi_hi+=c/30;
+
+ if (numrows < targrows) {
+ rows[numrows++] = h;
+ samplerows += 1;
+ continue;
+ }
+
+ if (rowstoskip < 0)
+ rowstoskip = reservoir_get_next_S(&rstate, samplerows,
targrows);
+
+ if (rowstoskip <= 0)
+ {
+ int k = (int) (targrows *
sampler_random_fract(rstate.randstate));
+
+ Assert(k >= 0 && k < targrows);
+ heap_freetuple(rows[k]);
+ rows[k] = h;
+ }
+
+ rowstoskip -= 1;
+ samplerows += 1;
+ }
+
+ index_endscan(index_scan);
+
+ if (numrows == targrows)
+ qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
+
+ *totalrows = samplerows;
+ // totalblocks = RelationGetNumberOfBlocks(onerel);
+ // *totalrows = vac_estimate_reltuples(onerel, true, totalblocks, bs.m,
liverows);
+ // if (bs.m > 0)
+ // *totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
+ // else
+ // *totaldeadrows = 0.0;
+
+ ereport(elevel,
+ (errmsg("\"%s\": scanned %d of %u pages, "
+ "containing %.0f live rows and %.0f
dead rows; "
+ "%d rows in sample, %.0f estimated
total rows",
+ RelationGetRelationName(index),
+ -1, -1,
+ -1.0, -1.0,
+ numrows, *totalrows)));
+
+
+ return numrows ;
+}
+
+/*
+ * Return correlation coeffiecient given sum(x*y), where x is a list giving
+ * sort with values order by one thing, and y is list given order sorted by
+ * another thing (eg. heap value vs. heap location).
+ */
+static double
+correlation(int values_cnt, double corr_xysum)
+{
+ /*----------
+ * Since we know the x and y value sets are both
+ * 0, 1, ..., values_cnt-1
+ * we have sum(x) = sum(y) =
+ * (values_cnt-1)*values_cnt / 2
+ * and sum(x^2) = sum(y^2) =
+ * (values_cnt-1)*values_cnt*(2*values_cnt-1) / 6.
+ *----------
+ * ... and the correlation coefficient reduces to */
+ double corr_xsum = ((double) (values_cnt - 1)) *
+ ((double) values_cnt) / 2.0;
+ double corr_x2sum = ((double) (values_cnt - 1)) *
+ ((double) values_cnt) * (double) (2 * values_cnt - 1) / 6.0;
+
+ return (values_cnt * fabs(corr_xysum) - corr_xsum * corr_xsum) /
+ (values_cnt * corr_x2sum - corr_xsum * corr_xsum);
+}
+
+/*
+ * Compute avg of separately-computed correlation values, rather than
+ * correlation of single sample across entire table, which overestimates
+ * correlation for large tables. targrows is the number of pages to sample in
+ * each batch.
+ */
+static double
+idx_correlation (Oid indexoid, VacAttrStatsP stats, int targrows, int
num_mcvs, Datum *mcv_values)
+{
+ HeapTuple *rows= stats->rows; // reusing this allocation
+ double corr_sum=0;
+ int corr_samples=0;
+ int numrows = 0;
+ double deadrows = 0;
+ double liverows = 0;
+ long int batch;
+ double samplerows = 0;
+
+ Snapshot snapshot = GetActiveSnapshot();
+ Oid heapoid = IndexGetRelation(indexoid, false);
+ Relation heap = relation_open(heapoid, AccessShareLock);
+ Relation index= RelationIdGetRelation(indexoid);
+ IndexScanDesc index_scan = index_beginscan(heap, index, snapshot, 1,
0);
+
+ // int strategy=BTGreaterEqualStrategyNumber;
+ // Oid opfam=get_opfamily_member(opfamily, stats->attrtypid,
stats->attrtypid, BTGreaterEqualStrategyNumber);
+
+ /* For sampling: read the first TARGROWS TIDs from each value in the
MCV list */
+ // XXX: consider only the first handful of MCVs ?
+ // XXX: poor correlation in index can prolly happen without MCVs, eg.
normally-distributed float values without repeated keys */
+ // XXX: .. should we just read the first targrows TIDs returned by the
index or is there a better way ??
+
+ for (batch=0; batch<num_mcvs || (!num_mcvs && !batch); ) { // XXX:
protect against empty index ?
+ ScanKeyData scankeys;
+ double corr_xysum=0;
+ ItemPointer tid;
+
+ if (num_mcvs) {
+ // XXX: GreaterEqual ?
+ ScanKeyInit(&scankeys, stats->attr->attnum,
BTEqualStrategyNumber, get_opcode( ((StdAnalyzeData
*)stats->extra_data)->eqopr), mcv_values[batch]);
+ } else {
+ /* No MCVs: single iteration over first targrows tuples
returned by index */
+ // XXX SK_SEARCHNOTNULL
+ ScanKeyEntryInitialize(&scankeys,
SK_ISNULL|SK_SEARCHNOTNULL, stats->attr->attnum, InvalidStrategy, InvalidOid,
InvalidOid, InvalidOid, (Datum)0);
+ }
+
+ index_rescan(index_scan, &scankeys, 1, NULL, 0);
+ for ( ; numrows<targrows &&
NULL!=(tid=index_getnext_tid(index_scan, ForwardScanDirection)); samplerows++) {
+ if (!ItemPointerIsValid(tid)) {
+ // Shouldn't happen ?
+ deadrows++;
+ continue;
+ }
+
+ // elog(WARNING, "%s %s %d TID: %d,%d", __FILE__,
__FUNCTION__, __LINE__, ItemPointerGetBlockNumber(tid),
ItemPointerGetOffsetNumber(tid) );
+ rows[numrows]->t_self = *tid;
+ rows[numrows]->t_len=numrows; // abusing this field;
+ numrows++;
+ liverows++;
+ }
+
+ // avoid NaN if many dead tuples? if (!numrows) continue;
+
+ /* Retrieved consecutive TIDs, now compute their (fine-level)
correlation */
+ qsort((void *) rows, numrows, sizeof(*rows), compare_rows);
+ for (int j=0; j<numrows; ++j)
+ corr_xysum+=j*rows[j]->t_len;
+
+ corr_samples++;
+ corr_sum+=correlation(numrows, corr_xysum);
+
+ numrows=0;
+ ++batch;
+ if (tid==NULL)
+ break;
+ /* Ran out of index in fewer than targrows */
+ }
+
+ ereport(LOG,
+ (errmsg("\"%s(%s)\": scanned %ld batches with total
%.0f TIDs, "
+ "containing %.0f live and %.0f dead
TIDs; ",
+ RelationGetRelationName(index),
+ NameStr(stats->attr->attname),
+ batch, samplerows,
+ liverows, deadrows)));
+
+ index_endscan(index_scan);
+ relation_close(index, NoLock);
+ relation_close(heap, NoLock);
+ return corr_sum/corr_samples;
+}
+
+#define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
+#define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID)
+
+/*
+ * Do a Vitter block walk along entire index, in physical order, to determine
+ * fraction of LEAF nodes which have "next" pointers with a higher block number
+ * than themselves.. For a highly-correlated table, this will be ~1, for a
+ * poorly correlated table(???), or one with many repeated keys, this will be
+ * between 0 and ~0.5, and index scan across those duplicate keys will have a
+ * high random component.
+ * Logic bits stolen from pgstatindex.
+ */
+
+#include "access/nbtree.h"
+#include "catalog/pg_am.h"
+
+static double
+idx_corr_fudge(Relation index, int targrows)
+{
+ BlockNumber totalblocks;
+ BlockSamplerData bs;
+
+ double leaf_pages = 0;
+ double fragments = 0;
+
+ // TransactionId OldestXmin;
+ // OldestXmin = GetOldestXmin(onerel, PROCARRAY_FLAGS_VACUUM);
+ // Snapshot snapshot =
RegisterSnapshot(GetTransactionSnapshot());
+ // Snapshot snapshot = GetLatestSnapshot();
+ // Snapshot snapshot = GetActiveSnapshot();
+ // Snapshot snapshot = GetOldestSnapshot();
+ if (!IS_BTREE(index)) {
+ relation_close(index, AccessShareLock);
+ return 1;
+ }
+
+ totalblocks = RelationGetNumberOfBlocks(index);
+ BlockSampler_Init(&bs, totalblocks-1, targrows, random());
+
+ while (BlockSampler_HasMore(&bs))
+ {
+ BlockNumber blkno = BlockSampler_Next(&bs);
+ Buffer buffer;
+ Page page;
+ BTPageOpaque opaque;
+
+ vacuum_delay_point();
+ CHECK_FOR_INTERRUPTS();
+
+ buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
RBM_NORMAL, vac_strategy); // bstrategy
+ LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+ page = BufferGetPage(buffer);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ if (P_ISDELETED(opaque) || P_IGNORE(opaque))
+ continue;
+ else if (P_ISLEAF(opaque))
+ {
+ leaf_pages++;
+ if (opaque->btpo_next != P_NONE && opaque->btpo_next <
blkno)
+ fragments++;
+ }
+
+ UnlockReleaseBuffer(buffer);
+
+ }
+
+ relation_close(index, AccessShareLock);
+ return fragments/leaf_pages;
+}
+
/*
* update_attstats() -- update attribute statistics for one relation
*
@@ -1658,6 +1997,7 @@ ind_fetch_func(VacAttrStatsP stats, int rownum, bool
*isNull)
/* exprvals and exprnulls are already offset for proper column */
i = rownum * stats->rowstride;
*isNull = stats->exprnulls[i];
+
return stats->exprvals[i];
}
@@ -2258,7 +2598,7 @@ compute_scalar_stats(VacAttrStatsP stats,
stats->attrtype->typlen == -1);
bool is_varwidth = (!stats->attrtype->typbyval &&
stats->attrtype->typlen < 0);
- double corr_xysum;
+ double corr_xysum=0;
SortSupportData ssup;
ScalarItem *values;
int values_cnt = 0;
@@ -2352,10 +2692,17 @@ compute_scalar_stats(VacAttrStatsP stats,
dups_cnt;
int slot_idx = 0;
CompareScalarsContext cxt;
+ float4 *corrs;
/* Sort the collected values */
cxt.ssup = &ssup;
cxt.tupnoLink = tupnoLink;
+
+ /*
+ * tuples were previously sorted by TID; now, sort by heap
+ * value, as needed for stats computations, and, for order
+ * relative to original sort for correlation computation.
+ */
qsort_arg((void *) values, values_cnt, sizeof(ScalarItem),
compare_scalars, (void *) &cxt);
@@ -2378,7 +2725,6 @@ compute_scalar_stats(VacAttrStatsP stats,
* is the last item of its group of duplicates (since the group
will
* be ordered by tupno).
*/
- corr_xysum = 0;
ndistinct = 0;
nmultiple = 0;
dups_cnt = 0;
@@ -2566,11 +2912,11 @@ compute_scalar_stats(VacAttrStatsP stats,
}
}
+ Datum *mcv_values=NULL;
/* Generate MCV slot entry */
if (num_mcv > 0)
{
MemoryContext old_context;
- Datum *mcv_values;
float4 *mcv_freqs;
/* Must copy the target values into anl_context */
@@ -2713,37 +3059,42 @@ compute_scalar_stats(VacAttrStatsP stats,
slot_idx++;
}
- /* Generate a correlation entry if there are multiple values */
- if (values_cnt > 1)
- {
+ /* Will generate a correlation entry if there are multiple
values */
+ if (values_cnt>1) {
MemoryContext old_context;
- float4 *corrs;
- double corr_xsum,
- corr_x2sum;
-
- /* Must copy the target values into anl_context */
old_context = MemoryContextSwitchTo(stats->anl_context);
+ /* Must copy the target values into anl_context */
corrs = (float4 *) palloc(sizeof(float4));
MemoryContextSwitchTo(old_context);
+ }
- /*----------
- * Since we know the x and y value sets are both
- * 0, 1, ..., values_cnt-1
- * we have sum(x) = sum(y) =
- * (values_cnt-1)*values_cnt / 2
- * and sum(x^2) = sum(y^2) =
- *
(values_cnt-1)*values_cnt*(2*values_cnt-1) / 6.
- *----------
- */
- corr_xsum = ((double) (values_cnt - 1)) *
- ((double) values_cnt) / 2.0;
- corr_x2sum = ((double) (values_cnt - 1)) *
- ((double) values_cnt) * (double) (2 *
values_cnt - 1) / 6.0;
-
- /* And the correlation coefficient reduces to */
- corrs[0] = (values_cnt * corr_xysum - corr_xsum *
corr_xsum) /
- (values_cnt * corr_x2sum - corr_xsum *
corr_xsum);
+ if (values_cnt>1 && fetchfunc==ind_fetch_func) { /* &&
num_mcv>0) { */
+ /* Compute alternate/fine-grained correlation if there
are MCV list with repeated values... */
+ // elog(WARNING, "%s %s %d value0:%lu", __FILE__,
__FUNCTION__, __LINE__, num_mcv?mcv_values[0] : 1 );
+ corrs[0] = idx_correlation (stats->attr->attrelid,
stats, samplerows, num_mcv, mcv_values);
+ elog(WARNING, "%s %s %d cors %lf", __FILE__,
__FUNCTION__, __LINE__, corrs[0]);
+ stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION;
+ stats->staop[slot_idx] = mystats->ltopr;
+ stats->stanumbers[slot_idx] = corrs;
+ stats->numnumbers[slot_idx] = 1;
+ slot_idx++;
+ }
+ else if (values_cnt>1) // XXX && fetchfunc==ind_fetch_func? //
correlation is now strictly a per-index attribute and not an per-column one
+ {
+ double fudge=1;
+
+ if (fetchfunc==ind_fetch_func) {
+ /* Compute correlation fudge factor for indices
with
+ * high number of duplicate values for an index
column,
+ * causing index scan to be highly random, due
to btree
+ * random insertion logic being used to avoid
O(N^2)
+ * insertion behavior in that case.
+ */
+ fudge =
1-idx_corr_fudge(RelationIdGetRelation(stats->attr->attrelid), samplerows);
+ }
+ corrs[0] = correlation(values_cnt, corr_xysum);
+ // XXX: corrs[0] *= fudge;
stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION;
stats->staop[slot_idx] = mystats->ltopr;
stats->stanumbers[slot_idx] = corrs;
diff --git a/src/backend/optimizer/path/costsize.c
b/src/backend/optimizer/path/costsize.c
index eb653cf..4529725 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -693,8 +693,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double
loop_count,
/*
* Now interpolate based on estimated index order correlation to get
total
* disk I/O cost for main table accesses.
+ * Note the sign of this expression: as csquared approaches 0, the
+ * estimate approaches max_IO_cost estimate;
*/
csquared = indexCorrelation * indexCorrelation;
+ elog(WARNING, "HERE 1222: csquared=%f minIO/R-P-C=%f maxIO/R-P-C=%f %s
%s %d",
+ csquared, min_IO_cost/spc_random_page_cost,
max_IO_cost/spc_random_page_cost,
+ __FILE__, __FUNCTION__, __LINE__);
run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index e103f5e..f097580 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6874,9 +6874,34 @@ btcostestimate(PlannerInfo *root, IndexPath *path,
double loop_count,
*/
MemSet(&vardata, 0, sizeof(vardata));
- if (index->indexkeys[0] != 0)
+ /* Expression --- maybe there are stats for the index itself */
+ relid = index->indexoid;
+ colnum = 1;
+ if (get_index_stats_hook &&
+ (*get_index_stats_hook) (root, relid, colnum, &vardata))
+ {
+ /*
+ * The hook took control of acquiring a stats tuple. If it did
+ * supply a tuple, it'd better have supplied a freefunc.
+ */
+ elog(WARNING, "HERE 1223: indexCorrelation %s %s %d", __FILE__,
__FUNCTION__, __LINE__);
+
+ if (HeapTupleIsValid(vardata.statsTuple) &&
+ !vardata.freefunc)
+ elog(ERROR, "no function provided to release variable
stats with");
+ }
+ else if ( NULL != (vardata.statsTuple = SearchSysCache3(STATRELATTINH,
+
ObjectIdGetDatum(relid),
+
Int16GetDatum(colnum),
+
BoolGetDatum(false))))
+ {
+ elog(WARNING, "HERE 1224: indexCorrelation %s %s %d", __FILE__,
__FUNCTION__, __LINE__);
+ vardata.freefunc = ReleaseSysCache;
+ }
+ else if (index->indexkeys[0] != 0)
{
/* Simple variable --- look to stats for the underlying table */
+ elog(WARNING, "HERE 1225: indexCorrelation %s %s %d", __FILE__,
__FUNCTION__, __LINE__);
RangeTblEntry *rte = planner_rt_fetch(index->rel->relid, root);
Assert(rte->rtekind == RTE_RELATION);
@@ -6904,32 +6929,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path,
double loop_count,
vardata.freefunc = ReleaseSysCache;
}
}
- else
- {
- /* Expression --- maybe there are stats for the index itself */
- relid = index->indexoid;
- colnum = 1;
-
- if (get_index_stats_hook &&
- (*get_index_stats_hook) (root, relid, colnum, &vardata))
- {
- /*
- * The hook took control of acquiring a stats tuple.
If it did
- * supply a tuple, it'd better have supplied a freefunc.
- */
- if (HeapTupleIsValid(vardata.statsTuple) &&
- !vardata.freefunc)
- elog(ERROR, "no function provided to release
variable stats with");
- }
- else
- {
- vardata.statsTuple = SearchSysCache3(STATRELATTINH,
-
ObjectIdGetDatum(relid),
-
Int16GetDatum(colnum),
-
BoolGetDatum(false));
- vardata.freefunc = ReleaseSysCache;
- }
- }
if (HeapTupleIsValid(vardata.statsTuple))
{
--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance