Simon Riggs <[EMAIL PROTECTED]> writes: >> That's what I want to do: >> 1. Replace not very useful indexCorrelation with indexClustering. > > An opinion such as "not very useful" isn't considered sufficient > explanation or justification for a change around here.
There's been some previous discussion about how "correlation" was not really what we wanted to be measuring. But that discussion was in regards to cross-column "correlation". In that case we're trying to predict how selective a clause will be. If we read x% of the table due to a restriction on X what percentage of the values of Y will be represented? In this case I think we do need to know correlation or something like it. That's because what we're trying to predict is how close to sequential the i/o accesses will be. If there's no correlation between index order and disk order then they'll be random. If they're highly correlated then accesses will be close to sequential. It's possible there's some sort of "block-wise correlated" measure which would be even better for our needs. We don't care if all the high values are towards the start and low values towards the end as long as each section is in order, for example. It's also possible that we could use something like what you describe to predict how many physical i/os will happen altogether. If the table is highly clustered but disordered then the io will be random access but the cache will be more effective than if the table is highly correlated but not clustered (though it would take a large table to make that possible I think). In short I think what's needed is someone to review a lot of different stats metrics for correlation and clustering and do some analysis of how each would be useful for cost modelling. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq