On Jul 14, 2010, at 12:40 PM, Heikki Linnakangas wrote:

> On 14/07/10 13:12, PostgreSQL - Hans-Jürgen Schönig wrote:
>> hello everybody,
>> we are currently facing some serious issues with cross correlation issue.
>> consider: 10% of all people have breast cancer. we have 2 genders (50:50).
>> if i select all the men with breast cancer, i will get basically nobody - 
>> the planner will overestimate the output.
>> this is the commonly known problem ...
>> this cross correlation problem can be quite nasty in many many cases.
>> underestimated nested loops can turn joins into a never ending nightmare and 
>> so on and so on.
>> my ideas is the following:
>> what if we allow users to specifiy cross-column combinations where we keep 
>> separate stats?
>> maybe somehow like this ...
>> or ...
>> y.id2)
>> clearly we cannot store correlation for all combinations of all columns so 
>> we somehow have to limit it.
>> what is the general feeling about something like that?
> +1 is my general feeling, it's good if you can tell the system to collect 
> additional statistics where needed. And once you have that, you can write an 
> agent or something to detect automatically which extra statistics might be 
> useful.

it seems i can leave my bunker where i was hiding for cover when i was waiting 
for a reply ;).
yes, my idea was to have an agent as well - but this is just some follow up 

> However, the problem is how to represent and store the cross-correlation. For 
> fields with low cardinality, like "gender" and boolean "breast-cancer-or-not" 
> you can count the prevalence of all the different combinations, but that 
> doesn't scale. Another often cited example is zip code + street address. 
> There's clearly a strong correlation between them, but how do you represent 
> that?

we could play the same story with a table storing people including their home 
country and the color of their skin.
obviously we will have more black people in african countries..

> For scalar values we currently store a histogram. I suppose we could create a 
> 2D histogram for two columns, but that doesn't actually help with the zip 
> code + street address problem.

i think we might go for a second relation here specifically for this issue and 
a boolean flag in the current stats table indicating that additional 
correlation stats exist (to avoid an additional lookup unless really necessary).
do you have a useful syntax in mind? the thing is: this issue can be isolated 
inside a table (e.g. WHERE a.id = a.id2 AND a.id3 = a.id4) or it might span two 
tables with an arbitrary number of fields.

        many thanks,


Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to