Re: Bitmap scan is undercosted? - boolean correlation

Jeff Janes Wed, 06 Dec 2017 20:52:21 -0800

On Tue, Dec 5, 2017 at 10:50 AM, Tom Lane <[email protected]> wrote:

> Jeff Janes <[email protected]> writes:
> > On Dec 3, 2017 15:31, "Tom Lane" <[email protected]> wrote:
> >> Jeff Janes <[email protected]> writes:
> >>> But I do see that ties within the logical order of the column values
> are
> >>> broken to agree with the physical order.  That is wrong, right?  Is
> there
> >>> any argument that this is desirable?
>
> >> Uh ... what do you propose doing instead?  We'd have to do something
> with
> >> ties, and it's not so obvious this way is wrong.
>
> > Let them be tied.  If there are 10 distinct values, number the values 0
> to
> > 9, and all rows of a given distinct values get the same number for the
> > logical order axis.
> > Calling the correlation 0.8 when it is really 0.0 seems obviously wrong
> to
> > me.  Although if we switched btree to store duplicate values with tid as
> a
> > tie breaker, then maybe it wouldn't be as obviously wrong.
>
> I thought some more about this.  What we really want the correlation stat
> to do is help us estimate how randomly an index-ordered scan will access
> the heap.

The correlation is used in another place, estimating how much of the table
we will visit in the first place.  If the correlation is very high, then
scanning 10% of the index leaf pages means we will visit 10% of the table.
If the correlation is low, then we use Mackert and Lohman, and (in the case
of visiting 10% of the index) predict we will visit most of the table.
Assuming effective_cache_size is high, we will visit most of the table just
once, but still in a random order, because subsequent visits for the same
query will be found in the cache.  Rather than visiting the various pages
repeatedly and not finding them in cache each time.

In addition to estimating how much of the table we visit, we also estimate
how "sequential like" those visits are.  Which is the use that you
describe.  Ideally for that use case, we would know for each distinct
value, how correlated the tids are with the leaf page ordering.  If the
index is freshly built, that is very high.  We visit 1/10 of the index,
which causes us to visit 100% of the table but in perfect order, plucking
1/10 of the tuples from each table page.

But visiting 100% of the table in physical order in order to pluck out 10%
of the tuples from each page is quite different than visiting 10% of the
table pages in physical order to pluck out 100% of the tuples from those
pages and 0% from the pages not visited.

...

BTW, I disagree that "correlation = zero" is the right answer for this
> particular example.  If the btree is freshly built, then an index-order
> scan would visit all the heap pages in sequence to fetch "f" rows, and
> then visit them all in sequence again to fetch "t" rows, which is a whole
> lot better than the purely random access that zero correlation implies.
> So I think 0.8 or so is actually a perfectly reasonable answer when the
> index is fresh.  The trouble is just that it'd behoove us to derate that
> answer somewhat for the probability that the index isn't fresh.
>

But, for the case of "how much of the table do we visit at all",
correlation = zero is the right answer, even if it isn't the right answer
for "how sequentially do we visit whatever we visit"

> My first thought for a concrete fix was to use the mean position of
> a group of duplicates for the purposes of the correlation calculation,
> but on reflection that's clearly wrong.  We do have an idea, from the
> data we have, whether the duplicates are close together in the heap
> or spread all over.  Using only mean position would fail to distinguish
> those cases, but really we'd better penalize the spread-all-over case.
> I'm not sure how to do that.
>

Departing from correlations, we could also try to estimate "How many
different table pages does each index leaf page reference".  This could
capture functional dependencies which are strong, but not in the form of
linear correlations.  (The current extended statistics only captures
dependencies between user columns, not between one user column and one
system column such as table slot)

For whatever its worth, here is my "let ties be ties" patch.

It breaks two regression tests due to plan changes, and both are cases
where maybe the plan ought to change for the very reason being discussed.
If I just put random gibberish into the correlation field, more regression
tests fail, so I think my implementation is not too far broken.

The accumulations into corr_ysum and corr_y2sum could trivially be pushed
down into the "if", and corr_xysum could as well with a little algebra.
But that seems like premature optimization for a proof-of-concept patch.

Cheers,

Jeff

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index f952b3c..3369784 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2309,6 +2309,8 @@ compute_scalar_stats(VacAttrStatsP stats,
 	bool		is_varwidth = (!stats->attrtype->typbyval &&
 							   stats->attrtype->typlen < 0);
 	double		corr_xysum;
+	double		corr_ysum;
+	double		corr_y2sum;
 	SortSupportData ssup;
 	ScalarItem *values;
 	int			values_cnt = 0;
@@ -2429,6 +2431,8 @@ compute_scalar_stats(VacAttrStatsP stats,
 		 * be ordered by tupno).
 		 */
 		corr_xysum = 0;
+		corr_ysum = 0;
+		corr_y2sum = 0;
 		ndistinct = 0;
 		nmultiple = 0;
 		dups_cnt = 0;
@@ -2436,7 +2440,9 @@ compute_scalar_stats(VacAttrStatsP stats,
 		{
 			int			tupno = values[i].tupno;
 
-			corr_xysum += ((double) i) * ((double) tupno);
+			corr_xysum += ((double) ndistinct) * ((double) tupno);
+			corr_ysum += ((double) ndistinct);
+			corr_y2sum += ((double) ndistinct) * ((double) ndistinct);
 			dups_cnt++;
 			if (tupnoLink[tupno] == tupno)
 			{
@@ -2777,11 +2783,11 @@ compute_scalar_stats(VacAttrStatsP stats,
 			MemoryContextSwitchTo(old_context);
 
 			/*----------
-			 * Since we know the x and y value sets are both
+			 * Since we know the x value sets are
 			 *		0, 1, ..., values_cnt-1
-			 * we have sum(x) = sum(y) =
+			 * we have sum(x) = 
 			 *		(values_cnt-1)*values_cnt / 2
-			 * and sum(x^2) = sum(y^2) =
+			 * and sum(x^2) = 
 			 *		(values_cnt-1)*values_cnt*(2*values_cnt-1) / 6.
 			 *----------
 			 */
@@ -2791,8 +2797,8 @@ compute_scalar_stats(VacAttrStatsP stats,
 				((double) values_cnt) * (double) (2 * values_cnt - 1) / 6.0;
 
 			/* And the correlation coefficient reduces to */
-			corrs[0] = (values_cnt * corr_xysum - corr_xsum * corr_xsum) /
-				(values_cnt * corr_x2sum - corr_xsum * corr_xsum);
+			corrs[0] = (values_cnt * corr_xysum - corr_xsum * corr_ysum) /
+				sqrt((values_cnt * corr_x2sum - corr_xsum) * (values_cnt * corr_y2sum - corr_ysum));
 
 			stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION;
 			stats->staop[slot_idx] = mystats->ltopr;

Re: Bitmap scan is undercosted? - boolean correlation

Reply via email to