[EMAIL PROTECTED] wrote:

In this case, the behavior observed could be changed by altering the sample size for a table. I submit that an arbitrary fixed sample size is not a good base for the analyzer, but that the sample size should be based on the size of the table or some calculation of its deviation.


Mark,

Do you have any evidence that the Sample Size had anything to do
with the performance problem you're seeing?

I also do a lot with the complete Census/TIGER database.

Every problem I have with the optimizer comes down to the
fact that the data is loaded (and ordered on disk) by
State/County FIPS codes, and then queried by zip-code
or by city name.  Like this:

    Alabama    36101 [hundreds of pages with zip's in 36***]
    Alaska     99686 [hundreds of pages with zip's in 9****]
    Arizona    85701 [hundreds of pages with zip's in 855**]

Note that the zip codes are *NOT* sequential.

The "correlation" statistic sees that the Zip codes are not
sequential; so it makes the *HORRIBLE* assumption that they
are scattered randomly across the disk.

In reality, even though there's no total ordering of the
zip codes; any given zip code only exists on a couple
disk pages; so index scans would be the right choice.


But the single correlation parameter is not sufficient to let the optimizer known this.

No matter how large a sample size you choose, ANALYZE
will correctly see that Zip codes and State FIPS codes
are non-correlated, and the optimizer will overestimate
the # of pages an index scan will need.

  Ron

PS: I pointed out workarounds in my earlier posting
in this thread.  Yes, I'm using the same TIGER data
you are.



---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Reply via email to