RE: Statistical sampling and representative stats collection

Cherie_Machler Wed, 29 May 2002 05:20:14 -0700


Larry,


Thanks for taking the time to write this lengthy reply.   I've never seen
this particular information on histograms anywhere.

This has been very informative.

Cherie Machler
Oracle DBA
Gelco Information Network


                                                                                       
                           
                    "Larry Elkins"                                                     
                           
                    <elkinsl@flash       To:     <[EMAIL PROTECTED]>                
                           
                    .net>                cc:                                           
                           
                                         Subject:     RE: Statistical sampling and 
representative stats           
                    05/29/02 05:57        collection                                   
                           
                    AM                                                                 
                           
                                                                                       
                           
                                                                                       
                           




> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of John
> Kanagaraj
> Sent: Tuesday, May 28, 2002 1:49 PM
> To: Multiple recipients of list ORACLE-L
> Subject: RE: Statistical sampling and representative stats collection
>
>
> Btw, there is an option in 9i DBMS_STATS.AUTO_SAMPLE_SIZE. Has
> anyone tried
> this out? Or know how it works?

I would be interested in that as well.

> > for these stats on non-indexed columns? If so, I disagree on
> > this point,
> > even if it is general advice and not a rule. Stats on
> > non-indexed columns
> > can play a *large* role in CBO decisions.
>
> I agree with you, although I do have to contend that the effect
> is not very
> pronounced in all databases. This was revealed in some depth in
Wolfgang's
> paper at IOUG where he was able to actually determine this in a
> 10053 trace,
> and it was an eye opener for me. The issue I have with this is that the
> default number of buckets is 2 and that is nowhere near what is needed.
On
> the other hand, indiscriminately increasing the bucket size would
> leave you
> with a _Large_ number of histograms and result in the 'row cache
> lock' latch
> being taken out more that it should have been (as well as add to
> the cost of
> parsing).

I would only consider gathering this info, or customizing the stats
gathering process in general, on a case by case basis and only when needed.
Due to the nature of the data and queries, you might find the need to
gather
the non-indexed column stats on a single table and 1 column, or maybe 2
tables, or maybe many tables and columns, or maybe none at all.

You already know, then, how stats on non-indexed columns can play a role.
But since I received a few back-channel emails asking for examples of how
stats on the non-indexed columns make a difference, I'll include the
following extreme example that illustrates. But by no means am I saying you
should always gather them, just on an as needed basis. And I haven't
included the 10053 traces since that could get very tedious ;-)

Say we have two tables, tables A and B, each consisting of two columns, A
and B. Column A is the PK on each, column B is not indexed. So:

Table A: 50,000 Rows
Table B: 50,000 Rows
Column A.A: (PK) Values 1 thru 50,000
Column B.A: (PK) Values 1 thru 50,000
Column A.B: Not indexed, values 1 thru 25000, each occurring twice ? e.g.
1,1,2,2,3,3...
Column B.B: Not indexed, values 0 and 1, each occurring 25,000 times

Note that I am not really even dealing with any skewness at this point ?
this is an even distribution of values. Here are 4 sample queries:

Sample Queries:

Select *
>From A, B
WHERE A.A = B.A
  and A.B = 5 ?- Filters down to 2 rows, I want to drive by A in an NL
fashion

Select *
>From A, B
WHERE A.A = B.A
  and A.B = 5 ?- Filters down to 2 rows, I still want to drive by A in an
NL
fashion
  and B.B = 1 ?- Filters down to 25,000 rows if treated standalone

Select *
>From A, B
WHERE A.A = B.A
  and B.B = 1 ?- Filters down to 25,000 rows, I want FTS's and HJ.

Select *
>From A, B
WHERE A.A = B.A
  and B.B = 27000 ?- no rows meet this, I want to drive the query by table
B.

With a basic "compute for table for all indexed columns", we get hash joins
with all four, in the case of the first two, driving by table A, and the
next two driving by table B. There were no stats on the non-indexed column.
This isn't what I really wanted for all 4 queries.

Now, if I simply do an "analyze table compute statistics", in which case
stats are gathered for both columns A and B, you get the default bucket and
two rows for each column over in DBA_TAB_HISTOGRAMS.

In the case of the first 2 queries, I get a nested loops driven by table A
using an FTS and an index lookup into table B. This is what I want because
of the really restrictive filtering criteria on table A's non-indexed
column
B, resulting in two rows being selected, and two indexed lookups into table
B. The stats on the non-indexed column helped the CBO make this decision.

In the case of queries 3 and 4, I would get a MERGE JOIN, using a full
index
scan on A's PK to get each row, for the purpose of feeding the data in
sorted order (I might have preferred an FTS and actually doing the sort!).
And an FTS on table B. Note that by increasing the HASH_AREA_SIZE, queries
3
and 4 simply went with FTS's and an HJ, driving by table B. I would have
hoped that in the case of query 4 that the CBO would have chosen to drive
by
table B, using a nested loops index approach into table A. Upon no rows
being returned from table B, table A would not have been accessed at all.

So, we go back and analyze specifying a SIZE for column B. Queries 1 and 2
use a nested loops driving by table A, and the restrictive filtering
criteria result in 2 indexed lookups into table B. This is what I want. In
the case of query 3, and having dropped the hash area size back down, it
does FTS's and an HJ driving by table B. This is what I want.

And in the case of query 4, it uses a nested loops approach, driving by an
FTS on table B, and an indexed lookup into table A. This is what we want.
The criteria on the non-indexed column on table B will filter out all the
rows, meaning table A will not even be accessed. Without the stats on the
non-indexed column on table B, the CBO would have FTS'd both tables doing a
hash join.

Ok, so this is an extreme example, and it is not a one size fits all
approach that should always be applied. This was simply an example case
illustrating how the stats on the non-indexed columns can play a role, and
how we might even need to specify the SIZE to get the desired results
(going
back to your comment about Wolfgang's 10053 examples).

Larry G. Elkins
[EMAIL PROTECTED]
214.954.1781





--
Please see the official ORACLE-L FAQ: http://www.orafaq.com
--
Author:
  INET: [EMAIL PROTECTED]

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

RE: Statistical sampling and representative stats collection

Reply via email to