# Re: [HACKERS] WIP: cross column correlation ...

```On Wed, Feb 23, 2011 at 8:09 PM, Josh Berkus <j...@agliodbs.com> wrote:
>> Personally, I think the first thing we ought to do is add a real, bona
>> fide planner hint to override the selectivity calculation manually,
>> maybe something like this:
>>
>> WHERE (x < 5 AND y = 1) SELECTIVITY (0.1);
>>
>> Then, having provided a method for the DBA to extinguish the raging
>> flames of searing agony which are consuming them while a crocodile
>> chews off their leg and their boss asks them why they didn't use
>> Oracle, we can continue bikeshedding about the best way of fixing this
>> problem in a more user-transparent fashion.
>
> Is there some way we can do that without adding the selectivity hint to
> the query itself?  That's the biggest issue with hints.```
```
I've been mulling this issue over a bit more - Nathan Boley raised a
similar point upthread.  I think it's useful to consider some concrete
cases which can occur.

1. Default estimate.  The planner tends to estimate that the
selectivity of <something> = <something> is 0.005, and that the
selectivity of <something> != <something> is 0.995, when it doesn't
know any better.  This estimate often sucks.  Sometimes it sucks
because it's too high, other times because it's too low, and of course
sometimes it is close enough for government work.

2. One special customer.  Suppose we have a database that contains
lots and lots of people and associates different attributes to those
people, including customer_id.  We put all of our employees in the
table too, and assign them customer_id = 1, since the record with
customer.id = 1 represents us.  I've built this kind of system for
several different employers over the years.  Turns out, the subset of
the person table with customer_id = 1 looks very different, in terms
of the MCVs on the remaining columns and the distribution of the
values otherwise, than the records with customer_id != 1.  I'm sure
this problem comes up in different forms in other domains; this is
just where I've seen it the most.

3. The mostly-redundant condition.  Something like creation_date >
'some timestamp' AND active.  Turns out, most of the not active stuff
is also... old.  A variant of this is creation_date > 'some timestamp'
AND customer_id = 1, which overlaps #2.  For extra fun the creation
date and customer_id may be in different tables, with some
intermediate join muddying the waters.

4. The condition that's redundant except when it isn't.  The classic
example here is WHERE zipcode = <constant> AND state = <constant>.
Most of the time, the selectivity of the two clauses together is much
higher than the product of their individually selectivities; you might
as well ignore the second part altogether.  But if some numbskull user
enters a state that doesn't match the zipcode, then suddenly it
matters a lot - the selectivity drops to zero when the second part is

5. The bitfield.  Conditions like (x & 64) != 0.  I know disk is
cheap, but people keep doing this.

There are probably some others I'm missing, too.  That's just off the
top of my head.  Now here are some possible approaches to fixing it:

A. Decorate the query.  This would often be useful for case #1, and
some instances of #3 and #5.  It's useless for #2 and #4.

B. Specify a particular predicate and the selectivity thereof.  Like,
whenever you see (x & 64) = 0, assume the selectivity is 0.5.  Upon
reflection, this seems pretty terrible in every respect.  Unless you
only ever issue an extremely limited range of queries, you're going to
be hardwiring a lot of selectivities.  I think this really only
handles case #5 well, and maybe some instances of case #1.

C. Specify an expression and gather statistics on it as if it were a
column: i.e. ALTER TABLE tab ADD VIRTUAL STATISTICS COLUMN x & 64.
This is pretty good.  It is pretty much ideal for #2 and also handles
#5 and some cases of #3 and #1 well.  You could even make it handle
some instances of #4 if you made the virtual column ROW(state,
zipcode) and rewrote the query as a row comparison.

D. N x N implicativeness matrix.  Record for each pair of attributes
the extent to which a given value for A implies a value for B, and
derate the selectivity multipliers based on this information.  This is
an idea of Heikki's.  It seemed good to me when he proposed it, and I
think he proposed it in regards to #4, but I'm not sure we really ever
figured out how to make it work.

E. Given a set of columns (A1, .., An), collect MCVs and make a
histogram for ROW(A1, ..., An), and then use it to handle cases like
#4.  This is similar to C and is intended to handle the zipcode
problem, but it's not as flexible (because you are only specifying
columns, not expressions).  However, it's intended to work without
rewriting the state/zipcode comparisons as a rowcompare.

If you want to take the above as in any way an exhaustive survey of
the landscape (which it isn't), C seems like a standout, maybe
augmented by the making the planner able to notice that A1 = x1 AND A2
= x2 is equivalent to (A1,A2) = (x1, x2) so you don't have to rewrite
queries as much.

I don't really know how to handle the join selectivity problem.  I am
not convinced that there is a better solution to that than decorating
the query.  After all the join selectivity depends not only on the
join clause itself, but also on what you've filtered out of each table
in the meantime.

Note that I am not sure whether any of this is similar to what the WIP
patch already implements, so apologies for possibly rampaging off in a
different direction and/or reinventing your ideas.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
```