RE: I/O EVENTS

Jack Silvey Tue, 21 May 2002 21:53:47 -0700

Cary,

This is greatness.


Jack


--- Cary Millsap <[EMAIL PROTECTED]> wrote:
> I have an example for you (Anjo, I hope you won't
> mind). A prospect we
> visited once upon a time had been fighting a
> performance problem with an
> Oracle Payroll program. They "knew" what their
> problem was: very
> clearly, v$system_event was telling them that their
> overwhelmingly
> dominant system performance problem was waits for
> "latch free." (Well,
> no it wasn't. It was waits for the Oracle Payroll
> program. ;) )
> 
> Actually, the v$system_event view indicated clearly
> that the top wait
> event was "SQL*Net message from client" (and some
> other "SQL*Net %"
> stuff), but of course every good DBA knows that you
> have to discard all
> the "SQL*Net" events because they are "idle" events.
> 
> Three months of trying and failing to fix the
> problem led finally to a
> last-resort, desperation hardware upgrade. Twelve
> 700MHz CPUs became
> twelve new 1,024MHz CPUs. (Don't ask why "waits for
> latch free" was
> translated as "need to upgrade CPUs.") And
> unfortunately, the Payroll
> program got *slower*. Not just "seemed
> slower"--*was* slower.
> 
> After the upgrade failed to help, these guys finally
> let us come see
> them. We traced the Payroll process (10046, level
> 8), very carefully
> initiating the trace immediately after the
> initiation of the Payroll
> run, and very carefully terminating the trace
> immediately after the
> completion of the Payroll run. The key here is that
> every second of data
> in the trace file was a second of somebody's
> response time.
> 
> The result was a really fun story. The thing ran in
> about 33 minutes.
> Here's a summary of what was in the session's trace
> file:
> 
> Oracle Kernel Event                      Duration   
> Calls        Avg
> 
> ------------------------------ ------------------
> -------- ----------
> SQL*Net message from client       984.01s   49.6%  
> 95,161  0.010340s
> SQL*Net more data from client     418.82s   21.1%   
> 3,345  0.125208s
> db file sequential read           279.54s   14.1%  
> 45,085  0.006200s
> CPU service                       248.69s   12.5% 
> 222,760  0.001116s
> unaccounted-for                    27.73s    1.4%   
> 
> latch free                         23.69s    1.2%  
> 34,695  0.000683s
> log file sync                       1.09s    0.1%   
>   506  0.002154s
> SQL*Net more data to client         0.83s    0.0%  
> 15,982  0.000052s
> log file switch completion          0.28s    0.0%   
>     3  0.093333s
> enqueue                             0.25s    0.0%   
>   106  0.002358s
> SQL*Net message to client           0.24s    0.0%  
> 95,161  0.000003s
> buffer busy waits                   0.22s    0.0%   
>    67  0.003284s
> db file scattered read              0.01s    0.0%   
>     2  0.005000s
> SQL*Net break/reset to client       0.00s    0.0%   
>     2  0.000000s
> ------------------------------ ------------------
> -------- ----------
> Total                           1,985.40s  100.0%  
> 
> Well, waits for "latch free" accounted for only 1.2%
> of the total
> response time. If the DBAs had been completely
> successful in eliminating
> "latch free" waits from their system, it would have
> made only a 1.2%
> difference in the runtime of this program. Point 1:
> <emphasis>This
> program's bottleneck is not the same as its system's
> system-wide average
> bottleneck.</emphasis> This happens all the time.
> You can NOT see this
> type of problem when you look only at system-wide
> average statistics.
> 
> You probably didn't pay much heed to the first thing
> you should have
> noticed: slightly more than 70% of this program's
> runtime was consumed
> by so-called "server idle" time. ??! You can't
> ignore this time, even if
> people do call it "idle." Idle or not, this time was
> part of someone's
> response time, so we need to deal with it. Now if we
> hadn't collected
> our statistics so carefully (with proper time scope
> and proper program
> scope), then we would have seen probably much more
> "SQL*Net message from
> client" time in our data. If you make that
> particular collection error,
> then you *have* to disregard the so-called "idle"
> wait time.
> 
> What was the Payroll program's problem? A couple of
> things. First, we
> knew that the Payroll program was running on the
> same host as Oracle.
> But the latency per call on the "SQL*Net message
> from client" event
> looked suspiciously LAN-like (order of 10ms), not
> IPC-like (order of 1ms
> or less). So we checked the tnsnames.ora file. Turns
> out that to
> conserve administration effort, the DBAs had decided
> to use a single
> tnsnames.ora file system-wide. So the Payroll
> program was using the same
> TCP/IP protocol that the system's PC clients were
> using. Just by
> switching the protocol to BEQ instead of TCP in the
> db server's
> tnsnames.ora file (a change that took about 15
> minutes to test, 20
> seconds to implement, and about a week to navigate
> through change
> control), we eliminated about 50% of the program's
> runtime. 
> 
> Next, what's with the "SQL*Net more data from
> client"? The developers of
> this particular Payroll program did a pretty bad
> thing in their code:
> they passed several really long (10KB) SQL texts
> from the client to the
> server in their parse calls (oh yeh, and with no
> bind variables either).
> Anyway, each time the Oracle kernel received a
> packet containing only
> part of a SQL statement, it tallied to "SQL*Net more
> data from client"
> while it waited for the next piece. Of course,
> communicating via the
> network interface card instead of IPC didn't help,
> but passing huge SQL
> statements from the app to the db is bad form to
> begin with (see Ixora
> for more detail on just how bad this is).
> 
> Finally, realize that Oracle tallies to "SQL*Net
> message from client"
> pretty much every time it's "between" working on db
> call fulfillment.
> Therefore, there were on the order of 95,161 db
> calls during this
> 33-minute period. Looking at the SQL, we discovered
> that of course,
> since the app developers didn't use bind variables,
> there was a parse
> call for almost every execute. This accounted for
> thousands of
> unnecessary parse calls. Next, each execute call
> manipulated only zero
> or one row, which added up to dozens of execute
> calls for each single
> SQL statement with a multi-row result source. The
> program could have
> performed the business function with a fraction of
> the db calls.
> 
> Oh yes... Why did the Payroll program get *slower*
> after the upgrade?
> Little Payroll program time was spent using CPU, so
> the upgrade had very
> little direct positive effect upon the program. Most
> of the program's
> time was spent waiting for network latency. Other
> programs ran at the
> same time as this Payroll job. The CPU upgrade made
> *them* faster, which
> intensified *their* constant number of network calls
> into a smaller time
> window. The result was increased competition for the
> network during the
> 
=== message truncated ===


__________________________________________________
Do You Yahoo!?
LAUNCH - Your Yahoo! Music Experience
http://launch.yahoo.com
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Jack Silvey
  INET: [EMAIL PROTECTED]

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

RE: I/O EVENTS

Reply via email to