Re: [HACKERS] cross column correlation revisted

Yeb Havinga Wed, 14 Jul 2010 04:20:42 -0700

Heikki Linnakangas wrote:

However, the problem is how to represent and store thecross-correlation. For fields with low cardinality, like "gender" andboolean "breast-cancer-or-not" you can count the prevalence of all thedifferent combinations, but that doesn't scale. Another often citedexample is zip code + street address. There's clearly a strongcorrelation between them, but how do you represent that?
For scalar values we currently store a histogram. I suppose we couldcreate a 2D histogram for two columns, but that doesn't actually helpwith the zip code + street address problem.

In my head the neuron for 'principle component analysis' went on whilereading this. Back in college it was used to prepare input data beforefeeding it into a neural network. Maybe ideas from PCA could be helpful?


regards,
Yeb Havinga



--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cross column correlation revisted

Reply via email to