On Wed, Feb 23, 2011 at 8:09 PM, Josh Berkus <j...@agliodbs.com> wrote: >> Personally, I think the first thing we ought to do is add a real, bona >> fide planner hint to override the selectivity calculation manually, >> maybe something like this: >> >> WHERE (x < 5 AND y = 1) SELECTIVITY (0.1); >> >> Then, having provided a method for the DBA to extinguish the raging >> flames of searing agony which are consuming them while a crocodile >> chews off their leg and their boss asks them why they didn't use >> Oracle, we can continue bikeshedding about the best way of fixing this >> problem in a more user-transparent fashion. > > Is there some way we can do that without adding the selectivity hint to > the query itself? That's the biggest issue with hints.
I've been mulling this issue over a bit more - Nathan Boley raised a similar point upthread. I think it's useful to consider some concrete cases which can occur. 1. Default estimate. The planner tends to estimate that the selectivity of <something> = <something> is 0.005, and that the selectivity of <something> != <something> is 0.995, when it doesn't know any better. This estimate often sucks. Sometimes it sucks because it's too high, other times because it's too low, and of course sometimes it is close enough for government work. 2. One special customer. Suppose we have a database that contains lots and lots of people and associates different attributes to those people, including customer_id. We put all of our employees in the table too, and assign them customer_id = 1, since the record with customer.id = 1 represents us. I've built this kind of system for several different employers over the years. Turns out, the subset of the person table with customer_id = 1 looks very different, in terms of the MCVs on the remaining columns and the distribution of the values otherwise, than the records with customer_id != 1. I'm sure this problem comes up in different forms in other domains; this is just where I've seen it the most. 3. The mostly-redundant condition. Something like creation_date > 'some timestamp' AND active. Turns out, most of the not active stuff is also... old. A variant of this is creation_date > 'some timestamp' AND customer_id = 1, which overlaps #2. For extra fun the creation date and customer_id may be in different tables, with some intermediate join muddying the waters. 4. The condition that's redundant except when it isn't. The classic example here is WHERE zipcode = <constant> AND state = <constant>. Most of the time, the selectivity of the two clauses together is much higher than the product of their individually selectivities; you might as well ignore the second part altogether. But if some numbskull user enters a state that doesn't match the zipcode, then suddenly it matters a lot - the selectivity drops to zero when the second part is added. 5. The bitfield. Conditions like (x & 64) != 0. I know disk is cheap, but people keep doing this. There are probably some others I'm missing, too. That's just off the top of my head. Now here are some possible approaches to fixing it: A. Decorate the query. This would often be useful for case #1, and some instances of #3 and #5. It's useless for #2 and #4. B. Specify a particular predicate and the selectivity thereof. Like, whenever you see (x & 64) = 0, assume the selectivity is 0.5. Upon reflection, this seems pretty terrible in every respect. Unless you only ever issue an extremely limited range of queries, you're going to be hardwiring a lot of selectivities. I think this really only handles case #5 well, and maybe some instances of case #1. C. Specify an expression and gather statistics on it as if it were a column: i.e. ALTER TABLE tab ADD VIRTUAL STATISTICS COLUMN x & 64. This is pretty good. It is pretty much ideal for #2 and also handles #5 and some cases of #3 and #1 well. You could even make it handle some instances of #4 if you made the virtual column ROW(state, zipcode) and rewrote the query as a row comparison. D. N x N implicativeness matrix. Record for each pair of attributes the extent to which a given value for A implies a value for B, and derate the selectivity multipliers based on this information. This is an idea of Heikki's. It seemed good to me when he proposed it, and I think he proposed it in regards to #4, but I'm not sure we really ever figured out how to make it work. E. Given a set of columns (A1, .., An), collect MCVs and make a histogram for ROW(A1, ..., An), and then use it to handle cases like #4. This is similar to C and is intended to handle the zipcode problem, but it's not as flexible (because you are only specifying columns, not expressions). However, it's intended to work without rewriting the state/zipcode comparisons as a rowcompare. If you want to take the above as in any way an exhaustive survey of the landscape (which it isn't), C seems like a standout, maybe augmented by the making the planner able to notice that A1 = x1 AND A2 = x2 is equivalent to (A1,A2) = (x1, x2) so you don't have to rewrite queries as much. I don't really know how to handle the join selectivity problem. I am not convinced that there is a better solution to that than decorating the query. After all the join selectivity depends not only on the join clause itself, but also on what you've filtered out of each table in the meantime. Note that I am not sure whether any of this is similar to what the WIP patch already implements, so apologies for possibly rampaging off in a different direction and/or reinventing your ideas. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers