On 2019-09-12 12:54:55 -0400, Tom Lane wrote: > "Peter J. Holzer" <[email protected]> writes: > > we'll consider just three columns, which we unimaginatively call a, b, > > and c. There are also three indexes: > > > t_a_idx btree (a) WHERE a IS NOT NULL > > t_b_idx btree (b) WHERE b IS NOT NULL > > t_a_b_idx btree (a, b) WHERE a IS NOT NULL AND b IS NOT NULL > > > Nowe I have a query > > select c from t where a='A' and b='B';
Just in case somebody wants to play along at home: A narrow table with
just these three columns and synthetic data shows the problem but
doesn't use the flipped multicolumn index either. I guess you have to
pad the table a bit to sufficiently change the costs.
> > This uses t_b_idx, not - as I expected - t_a_b_idx.
>
> > The distribution of values in columns a and b is quite different: a has
> > 10 different values of similar frequency (and no null values). b has
> > only a single non-null value which with a frequency of about 1 %.
>
> > So I definitely understand why it would prefer t_b_idx to t_a_idx, but
> > certainly t_a_b_idx should be even better?
>
> Not necessarily --- t_a_b_idx is (presumably) physically bigger than
> t_b_idx, which makes it more expensive to search. The additional
> selectivity gain apparently doesn't outweigh that.
The index is about 20 % larger, but only 1/10 of it has to be searched,
so it should take 1 - (1.2 / 10) = 88 % less time.
> > If I create an index with the columns swapped:
> > t_b_a_idx btree (b, a) WHERE b IS NOT NULL and a IS NOT NULL
> > this index will be used.
>
> Hmm. Probably that has something to do with a calculation about
> the selectivity of the leading index column, ie do you have to
> scan 10% of the index or 1% of the index.
Does it use only the selectivity of the first column? That would explain
it. The total selectivity of both columns is obviously the same.
> It's not taking the partial-index filter into account in that, I
> suspect, which skews the results in this case --- but that would be
> hard to account for accurately.
Hmm. Wouldn't that be a problem for partial indexes in general? They
usually cover only a small portion of the table and if the selectivity
is computed relative to the whole table the result may be way off.
I think it should be possible to adjust for a "WHERE column IS NOT NULL"
filter, because null_frac is in the statistics. For the general case you
would need an estimate of the number of rows covered by the index, which
I don't think we have.
> Anyway I can't get excited about optimizing for a single non-null
> value.
That's certainly an extreme case, but I've seen bad plans for similar
queries every now and then. At least now I have an idea what causes it
and maybe a solution/workaround.
hp
--
_ | Peter J. Holzer | we build much bigger, better disasters now
|_|_) | | because we have much more sophisticated
| | | [email protected] | management tools.
__/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
signature.asc
Description: PGP signature
