Simon Riggs <[EMAIL PROTECTED]> writes:

>> That's what I want to do:
>> 1. Replace not very useful indexCorrelation with indexClustering.
> An opinion such as "not very useful" isn't considered sufficient
> explanation or justification for a change around here.

There's been some previous discussion about how "correlation" was not really
what we wanted to be measuring. But that discussion was in regards to
cross-column "correlation". In that case we're trying to predict how selective
a clause will be. If we read x% of the table due to a restriction on X what
percentage of the values of Y will be represented?

In this case I think we do need to know correlation or something like it.
That's because what we're trying to predict is how close to sequential the i/o
accesses will be. If there's no correlation between index order and disk order
then they'll be random. If they're highly correlated then accesses will be
close to sequential.

It's possible there's some sort of "block-wise correlated" measure which would
be even better for our needs. We don't care if all the high values are towards
the start and low values towards the end as long as each section is in order,
for example.

It's also possible that we could use something like what you describe to
predict how many physical i/os will happen altogether. If the table is highly
clustered but disordered then the io will be random access but the cache will
be more effective than if the table is highly correlated but not clustered
(though it would take a large table to make that possible I think).

In short I think what's needed is someone to review a lot of different stats
metrics for correlation and clustering and do some analysis of how each would
be useful for cost modelling. 

  Gregory Stark

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?


Reply via email to