Re: Statistical sampling and representative stats collection

Tim Gorman Fri, 24 May 2002 05:09:48 -0700

Fantastic research, John!

----- Original Message -----
To: "Multiple recipients of list ORACLE-L" <[EMAIL PROTECTED]>
Sent: Thursday, May 23, 2002 7:13 PM



> Jack,
>
> I conducted some tests of ANALYZE ESTIMATE vs COMPUTE for my IOUG paper
and
> arrived at the following: (cut-and-paste of relevant parts of the paper)
>
> --- Begin Quote ---
> MYTH:  "COMPUTE IS BETTER THAN ESTIMATE"
> This one generates an endless debate actually, so we will not take a firm
> stand either way. Rather, we will present some figures that throw some
light
> on the issue and allow us to step back and look at the situation. The
> problem with COMPUTE is that it has to scan the entire table, sort it and
> figure out the exact data distribution. On the other hand, ESTIMATE steps
> through samples of the data, sorts and analyzes only a portion of the
data.
>
> In a recent test for the effectiveness of COMPUTE versus ESTIMATE on a
> static clone of a reasonably large Oracle Apps database, the statistics
were
> generated and stored for both COMPUTE and ESTIMATE. The Database consisted
> of about 3,300 tables and 6,000 indexes and occupied approximately 120 Gb.
> The ESTIMATE percentage was defaulted to 10% and no activity other than
> ANALYZE was allowed on this clone during the entire period. Table
statistics
> including row count, average row length and blocks occupied were analyzed.
> This showed that there were some differences in row count and average row
> length on 321 of these tables. Row count differences ranged from a value
of
> 53 row less in the ESTIMATE of a table containing 205,743 rows (0.025%)
all
> the way up-to a count difference of 101,704 in 13,311,090 rows (0.76%).
Even
> assuming a difference of a maximum of 5% in these scenarios, you are not
far
> off the goal. Further analysis showed that a smaller average row length
> coupled with a small table produced larger variations than was usually
seen.
>
>
> The differences however, were far more pronounced in Indexes - differences
> of upto 300% were noticed. Further analysis showed that this was related
to
> the percentage of deleted leaf rows in the index. If this percentage is
> high, the possibility of ESTIMATE going wrong was also high, as the
> deletions are not factored in correctly. This was especially true if the
> deletions occurred in leaf blocks that were probably not involved in the
> ESTIMATE. When the deleted leaf rows was low or even nil within the index,
> the percentage difference was much lower, in the range of 4 to 5%.
>
> The real myth killer is the cost of COMPUTE versus ESTIMATE - COMPUTE
> required 66,553,308 reads versus 38,951,158 reads for ESTIMATE - almost
70%
> more reads for COMPUTE. The sorting involved in determining the averages
and
> data distribution was a clincher - COMPUTE processed 4,263,724,259 rows in
> sorting operations while ESTIMATE sorted just 18,025,069 - i.e. about 235%
> more rows were sorted for the COMPUTE operation. The last nail in the
coffin
> was the time taken to COMPUTE statistics - about 36 hours against the time
> to ESTIMATE of just 12 hours.
>
> While the figures speak for themselves, we will offer some general advice
to
> the cautious: ESTIMATE on tables and COMPUTE on Indexes. Columns are
> analyzed by default, but serve no useful purpose other than showing data
> spread. Hence, you could ANALYZE only Tables and Indexed columns alone. An
> identified list of 'small' tables could also be COMPUTED rather than
> ANALYZED. This advice is given because ESTIMATE on a table comes close as
> far as row count goes, while COMPUTE on Indexes generates a more accurate
> picture of both data distribution as well as object size statistics.
Testing
> the effectiveness of COMPUTE versus ANALYZE is simple and provides you
with
> figures that you can use to decide the strategy for your situation.
>
> Before we move to the next topic, keep in mind that an ANALYZE/ESTIMATE
with
> a sample size greater than or equal to 50% will result in COMPUTE.
>
> --- End Quote ---
>
> The problem is that this simple mathematical model looks only at object
> sizes and did not look at Column spread and sensitivity. However, I
believe
> that the combination of ESTIMATE on Tables and COMPUTE on Indexes would
> catch most of it.
>
> As always, YMMV!
>
> John Kanagaraj
> Oracle Applications DBA
> DBSoft Inc
> (W): 408-970-7002
>
> The manuals for Oracle are here: http://tahiti.oracle.com
> The manual for Life is here: http://www.gospelcom.net
>
> ** The opinions and statements above are entirely my own and not those of
my
> employer or clients **
>
>
> > -----Original Message-----
> > From: Jack Silvey [mailto:[EMAIL PROTECTED]]
> > Sent: Tuesday, May 21, 2002 2:44 PM
> > To: Multiple recipients of list ORACLE-L
> > Subject: Statistical sampling and representative stats collection
> >
> >
> > Hi all,
> >
> > Did some investigation about statistical sampling this
> > weekend since we are going to optimize our analyze
> > process soon, and would like some input from all you
> > orabrains on this one.
> >
> > I opened a TAR with Oracle asking about the sampling
> > algorithm of stats collection, and they assured me it
> > was random.
> >
> > The goal of analyze...estimate is to collect stats
> > that are representative of the data population as a
> > whole using a given sample set. Since analyzing tables
> > takes up resources (does sorts to order the data for
> > investigation) the fewer rows you use in estimate, the
> > less system resources you use and the faster the
> > analyze will go.
> >
> > Since our goal is to get as small a sample as possible
> > and still have stats that are representative, my
> > contention is that we could start by finding what the
> > margin of error will be for each sample size and gauge
> > our tolerance for it.
> >
> > One standard way to calculate margin of error for a
> > given sample is by using this formula:
> >
> > M = 1/SQRT(N)
> >
> > where:
> > M = margin of error
> > N=sample size
> >
> > So, if we can tolerate stats that have a 1% a margin
> > of error (will deviate from representative of the
> > whole population by 1%), our sample size should be
> > 10,000 rows.
> >
> > Also, a corollary (not a toyota corollary, though) to
> > this would be that the more rows you add to your
> > sample, the closer to representative your sample will
> > be. So, in order to test whether your sample is
> > representative enough, you could analyze using either
> > estimate 49% or compute, take a snapshot of the stats,
> > and then compare the stats from a 10,000 row estimate
> > to those. Then, add rows to your estimate until you
> > are satisfied with the stats.
> >
> > This of course is a pie in the sky mathematical model,
> > but seems like a reasonable place to start with
> > testing.
> >
> > Input? Input? Buhler? Buhler?
> >
> >
> > /jack silvey
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > LAUNCH - Your Yahoo! Music Experience
> > http://launch.yahoo.com
> > --
> > Please see the official ORACLE-L FAQ: http://www.orafaq.com
> > --
> > Author: Jack Silvey
> >   INET: [EMAIL PROTECTED]
> >
> > Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
> > San Diego, California        -- Public Internet access / Mailing Lists
> > --------------------------------------------------------------------
> > To REMOVE yourself from this mailing list, send an E-Mail message
> > to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
> > the message BODY, include a line containing: UNSUB ORACLE-L
> > (or the name of mailing list you want to be removed from).  You may
> > also send the HELP command for other information (like subscribing).
> >
> --
> Please see the official ORACLE-L FAQ: http://www.orafaq.com
> --
> Author: John Kanagaraj
>   INET: [EMAIL PROTECTED]
>
> Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
> San Diego, California        -- Public Internet access / Mailing Lists
> --------------------------------------------------------------------
> To REMOVE yourself from this mailing list, send an E-Mail message
> to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
> the message BODY, include a line containing: UNSUB ORACLE-L
> (or the name of mailing list you want to be removed from).  You may
> also send the HELP command for other information (like subscribing).

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Tim Gorman
  INET: [EMAIL PROTECTED]

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

Re: Statistical sampling and representative stats collection

Reply via email to