On Jul 14, 2010, at 12:40 PM, Heikki Linnakangas wrote:
> On 14/07/10 13:12, PostgreSQL - Hans-Jürgen Schönig wrote:
>> hello everybody,
>>
>> we are currently facing some serious issues with cross correlation issue.
>> consider: 10% of all people have breast cancer. we have 2 genders (50:50).
>> if i select all the men with breast cancer, i will get basically nobody -
>> the planner will overestimate the output.
>> this is the commonly known problem ...
>>
>> this cross correlation problem can be quite nasty in many many cases.
>> underestimated nested loops can turn joins into a never ending nightmare and
>> so on and so on.
>>
>> my ideas is the following:
>> what if we allow users to specifiy cross-column combinations where we keep
>> separate stats?
>> maybe somehow like this ...
>>
>> ALTER TABLE x SET CORRELATION STATISTICS FOR (id = id2 AND id3=id4)
>>
>> or ...
>>
>> ALTER TABLE x SET CORRELATION STATISTICS FOR (x.id = y.id AND x.id2 =
>> y.id2)
>>
>> clearly we cannot store correlation for all combinations of all columns so
>> we somehow have to limit it.
>>
>> what is the general feeling about something like that?
>
> +1 is my general feeling, it's good if you can tell the system to collect
> additional statistics where needed. And once you have that, you can write an
> agent or something to detect automatically which extra statistics might be
> useful.
>
it seems i can leave my bunker where i was hiding for cover when i was waiting
for a reply ;).
yes, my idea was to have an agent as well - but this is just some follow up
problem.
> However, the problem is how to represent and store the cross-correlation. For
> fields with low cardinality, like "gender" and boolean "breast-cancer-or-not"
> you can count the prevalence of all the different combinations, but that
> doesn't scale. Another often cited example is zip code + street address.
> There's clearly a strong correlation between them, but how do you represent
> that?
we could play the same story with a table storing people including their home
country and the color of their skin.
obviously we will have more black people in african countries..
>
> For scalar values we currently store a histogram. I suppose we could create a
> 2D histogram for two columns, but that doesn't actually help with the zip
> code + street address problem.
>
i think we might go for a second relation here specifically for this issue and
a boolean flag in the current stats table indicating that additional
correlation stats exist (to avoid an additional lookup unless really necessary).
do you have a useful syntax in mind? the thing is: this issue can be isolated
inside a table (e.g. WHERE a.id = a.id2 AND a.id3 = a.id4) or it might span two
tables with an arbitrary number of fields.
many thanks,
hans
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers