Re: Wait event problems

Rachel Carmichael Thu, 13 Jun 2002 18:15:44 -0700

yep, that's the one


--- Mohammad Rafiq <[EMAIL PROTECTED]> wrote:
> I think Rachel is referring following email from Carry Millsap.. It
> is a 
> really nice explanation of this issue....
> Regards
> Rafiq
> 
> 
> 
> I have an example for you (Anjo, I hope you won't mind). A prospect
> we
> visited once upon a time had been fighting a performance problem with
> an
> Oracle Payroll program. They "knew" what their problem was: very
> clearly, v$system_event was telling them that their overwhelmingly
> dominant system performance problem was waits for "latch free."
> (Well,
> no it wasn't. It was waits for the Oracle Payroll program. ;) )
> 
> Actually, the v$system_event view indicated clearly that the top wait
> event was "SQL*Net message from client" (and some other "SQL*Net %"
> stuff), but of course every good DBA knows that you have to discard
> all
> the "SQL*Net" events because they are "idle" events.
> 
> Three months of trying and failing to fix the problem led finally to
> a
> last-resort, desperation hardware upgrade. Twelve 700MHz CPUs became
> twelve new 1,024MHz CPUs. (Don't ask why "waits for latch free" was
> translated as "need to upgrade CPUs.") And unfortunately, the Payroll
> program got *slower*. Not just "seemed slower"--*was* slower.
> 
> After the upgrade failed to help, these guys finally let us come see
> them. We traced the Payroll process (10046, level 8), very carefully
> initiating the trace immediately after the initiation of the Payroll
> run, and very carefully terminating the trace immediately after the
> completion of the Payroll run. The key here is that every second of
> data
> in the trace file was a second of somebody's response time.
> 
> The result was a really fun story. The thing ran in about 33 minutes.
> Here's a summary of what was in the session's trace file:
> 
> Oracle Kernel Event                      Duration    Calls        Avg
> 
> ------------------------------ ------------------ -------- ----------
> SQL*Net message from client       984.01s   49.6%   95,161  0.010340s
> SQL*Net more data from client     418.82s   21.1%    3,345  0.125208s
> db file sequential read           279.54s   14.1%   45,085  0.006200s
> CPU service                       248.69s   12.5%  222,760  0.001116s
> unaccounted-for                    27.73s    1.4%
> latch free                         23.69s    1.2%   34,695  0.000683s
> log file sync                       1.09s    0.1%      506  0.002154s
> SQL*Net more data to client         0.83s    0.0%   15,982  0.000052s
> log file switch completion          0.28s    0.0%        3  0.093333s
> enqueue                             0.25s    0.0%      106  0.002358s
> SQL*Net message to client           0.24s    0.0%   95,161  0.000003s
> buffer busy waits                   0.22s    0.0%       67  0.003284s
> db file scattered read              0.01s    0.0%        2  0.005000s
> SQL*Net break/reset to client       0.00s    0.0%        2  0.000000s
> ------------------------------ ------------------ -------- ----------
> Total                           1,985.40s  100.0%
> 
> Well, waits for "latch free" accounted for only 1.2% of the total
> response time. If the DBAs had been completely successful in
> eliminating
> "latch free" waits from their system, it would have made only a 1.2%
> difference in the runtime of this program. Point 1: <emphasis>This
> program's bottleneck is not the same as its system's system-wide
> average
> bottleneck.</emphasis> This happens all the time. You can NOT see
> this
> type of problem when you look only at system-wide average statistics.
> 
> You probably didn't pay much heed to the first thing you should have
> noticed: slightly more than 70% of this program's runtime was
> consumed
> by so-called "server idle" time. ??! You can't ignore this time, even
> if
> people do call it "idle." Idle or not, this time was part of
> someone's
> response time, so we need to deal with it. Now if we hadn't collected
> our statistics so carefully (with proper time scope and proper
> program
> scope), then we would have seen probably much more "SQL*Net message
> from
> client" time in our data. If you make that particular collection
> error,
> then you *have* to disregard the so-called "idle" wait time.
> 
> What was the Payroll program's problem? A couple of things. First, we
> knew that the Payroll program was running on the same host as Oracle.
> But the latency per call on the "SQL*Net message from client" event
> looked suspiciously LAN-like (order of 10ms), not IPC-like (order of
> 1ms
> or less). So we checked the tnsnames.ora file. Turns out that to
> conserve administration effort, the DBAs had decided to use a single
> tnsnames.ora file system-wide. So the Payroll program was using the
> same
> TCP/IP protocol that the system's PC clients were using. Just by
> switching the protocol to BEQ instead of TCP in the db server's
> tnsnames.ora file (a change that took about 15 minutes to test, 20
> seconds to implement, and about a week to navigate through change
> control), we eliminated about 50% of the program's runtime.
> 
> Next, what's with the "SQL*Net more data from client"? The developers
> of
> this particular Payroll program did a pretty bad thing in their code:
> they passed several really long (10KB) SQL texts from the client to
> the
> server in their parse calls (oh yeh, and with no bind variables
> either).
> Anyway, each time the Oracle kernel received a packet containing only
> part of a SQL statement, it tallied to "SQL*Net more data from
> client"
> while it waited for the next piece. Of course, communicating via the
> network interface card instead of IPC didn't help, but passing huge
> SQL
> statements from the app to the db is bad form to begin with (see
> Ixora
> for more detail on just how bad this is).
> 
> Finally, realize that Oracle tallies to "SQL*Net message from client"
> pretty much every time it's "between" working on db call fulfillment.
> Therefore, there were on the order of 95,161 db calls during this
> 33-minute period. Looking at the SQL, we discovered that of course,
> since the app developers didn't use bind variables, there was a parse
> call for almost every execute. This accounted for thousands of
> unnecessary parse calls. Next, each execute call manipulated only
> zero
> or one row, which added up to dozens of execute calls for each single
> SQL statement with a multi-row result source. The program could have
> performed the business function with a fraction of the db calls.
> 
> Oh yes... Why did the Payroll program get *slower* after the upgrade?
> Little Payroll program time was spent using CPU, so the upgrade had
> very
> little direct positive effect upon the program. Most of the program's
> time was spent waiting for network latency. Other programs ran at the
> same time as this Payroll job. The CPU upgrade made *them* faster,
> which
> intensified *their* constant number of network calls into a smaller
> time
> window. The result was increased competition for the network during
> the
> Payroll run. Therefore, every network I/O call the Payroll call made
> was
> a little slower than before the CPU upgrade. The increase in network
> response time overwhelmed the tiny direct improvement of the CPU time
> reduction, resulting in a net degradation of Payroll performance.
> ...Not
> a good thing, because this Payroll program had a higher business
> priority than everything else on the system at the same time.
> 
> The End.
> 
> * * *
> 
> So, some wrap-up:
> 
> 1. Looking at v$session_event is better than looking at
> v$system_event
> for these things. When you collect system-wide data to solve a single
> program's performance problem, your irrelevant data can bury your
> relevant data. We see this all the time. Any time you query
> v$system_event and ask about the top-line events you see, your
> vulnerable to this problem.
> 
> 2. Furthermore, looking even at v$session_event tells you NOTHING
> about
> the user's experience. When a v$ query tells you that there's
> "SQL*Net
> message from client" time, you can't know without more data whether
> this
> stuff showed up in someone's response time, or if it corresponded to
> time when the USER was actually idle.
> 
> 3. Another important reason that looking at v$session_event data can
> mislead you is that there's no record of CPU time in there. To find
> that, you have to go to something like v$sesstat. The "CPU used by
> this
> session" statistic is supposed to give you the CPU time you need, but
> various bugs prevent you from getting what you need in many cases.
> 
> 4. Response time = CPU from sesstat + waits from session_event + an
> unaccounted-for gap in time. Lots of people believe that the "gap" is
> a
> deficiency of 10046 data: it's not. The irony is that the same
> deficiency is even worse in v$ table methods; you just can't tell
> that
> there even *is* a gap: you can't compute gap=R-(cpu+waits), because
> with
> v$-based methods you don't know what R is. This "gap" is one of those
> things that we've discovered how to interpret reliably after some
> work.
> For more information, see some of our Hotsos Profiler documentation
> at
> www.hotsos.com/dnloads/1/kevents and click "unaccounted-for" (the
> document is free, but you have to register). (Bear with us, most of
> the
> event documentation on our kevents page is pre-beta, but the
> "unaccounted-for" document is pretty much complete.)
> 
=== message truncated ===


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Rachel Carmichael
  INET: [EMAIL PROTECTED]

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

Re: Wait event problems

Reply via email to