On 06/11/14 23:15, Katharina Büchse wrote:
Hi,
I'm a phd-student at the university of Jena, Thüringen, Germany, in
the field of data bases, more accurate query optimization.
I want to implement a system in PostgreSQL that detects column
correlations and creates statistical data about correlated columns for
the optimizer. Therefore I need to store two dimensional statistics
(especially two dimensional histograms) in PostgreSQL.
I had a look at the description of "WIP: multivariate statistics /
proof of concept", which looks really promising, I guess these
statistics are based on scans of the data? For my system I need both
-- statistical data based on table scans (actually, samples are
enough) and those based on query feedback. Query feedback (tuple
counts and, speaking a little inaccurately, the where-part of the
query itself) needs to be extracted and there needs to be a decision
for the optimizer, when to take multivariate statistics and when to
use the one dimensional ones. Oracle in this case just disables one
dimensional histograms if there is already a multidimensional
histogram, but this is not always useful, especially in the case of a
feedback based histogram (which might not cover the whole data space).
I want to use both kinds of histograms because correlations might
occur only in parts of the data. In this case a histogram based on a
sample of the whole table might not get the point and wouldn't help
for the part of the data the user seems to be interested in.
There are special data structures for storing multidimensional
histograms based on feedback and I already tried to implement one of
these in C. In the case of two dimensions they are of course not "for
free" (one dimensional would be much cheaper), but based on the
principle of maximum entropy they deliver really good results. I
decided for only two dimensions because in this case we have the best
proportion of cost and benefit when searching for correlation (here
I'm relying on tests that were made in DB2 within a project called
CORDS which detects correlations even between different tables).
I'd be grateful for any advices and discussions.
Regards,
Katharina
Could you store a 2 dimensional histogram in a one dimensional array:
A[z] = value, where z = col * rowSize + row (zero starting index)?
Cheers,
Gavin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers